huggingface / dataset-viewer

Lightweight web API for visualizing and exploring any dataset - computer vision, speech, text, and tabular - stored on the Hugging Face Hub
https://huggingface.co/docs/datasets-server
Apache License 2.0
640 stars 65 forks source link

Remove or increase the 5GB limit? #2878

Open severo opened 1 month ago

severo commented 1 month ago

The dataset viewer shows statistics and provides filter + sort + search only for the first 5GB of each split. We are also unable to provide the exact number of rows for bigger splits.

Note that we "show" all the rows for parquet-native datasets (i.e., we can access the rows randomly, i.e., we have pagination).

Should we provide a way to increase or remove this limit?

kargaranamir commented 1 month ago

Please do this. If not possible, at least for the special datasets on request. For example now, special datasets can have a Python file loader with a data viewer at the same time. This can also be a process for some.

julien-c commented 1 month ago

only for Parquet-native datasets maybe?

kargaranamir commented 1 month ago

That works.

severo commented 1 month ago

only for Parquet-native datasets maybe?

I improved the description, because we already allow to "view" all the rows for parquet-native datasets. What we miss is the rest: stats, filter, sort, search, SQL queries in general, because they are run on the 5GB DuckDB export.

github-actions[bot] commented 2 days ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.