huggingface / dataset-viewer

Backend that powers the dataset viewer on Hugging Face dataset pages through a public API.
https://huggingface.co/docs/dataset-viewer
Apache License 2.0
702 stars 77 forks source link

Remove or increase the 5GB limit? #2878

Closed severo closed 4 months ago

severo commented 5 months ago

The dataset viewer shows statistics and provides filter + sort + search only for the first 5GB of each split. We are also unable to provide the exact number of rows for bigger splits.

Note that we "show" all the rows for parquet-native datasets (i.e., we can access the rows randomly, i.e., we have pagination).

Should we provide a way to increase or remove this limit?

kargaranamir commented 5 months ago

Please do this. If not possible, at least for the special datasets on request. For example now, special datasets can have a Python file loader with a data viewer at the same time. This can also be a process for some.

julien-c commented 5 months ago

only for Parquet-native datasets maybe?

kargaranamir commented 5 months ago

That works.

severo commented 5 months ago

only for Parquet-native datasets maybe?

I improved the description, because we already allow to "view" all the rows for parquet-native datasets. What we miss is the rest: stats, filter, sort, search, SQL queries in general, because they are run on the 5GB DuckDB export.

github-actions[bot] commented 4 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

kargaranamir commented 4 months ago

only for Parquet-native datasets maybe?

I improved the description, because we already allow to "view" all the rows for parquet-native datasets. What we miss is the rest: stats, filter, sort, search, SQL queries in general, because they are run on the 5GB DuckDB export.

Can we open this issue again? HF already allow to "view" all the rows for parquet-native datasets. But the stats are wrong as they are run on the 5GB DuckDB export.

severo commented 4 months ago

We already have an (internal) issue to fix the display (show in the UI if the stats are partial).

I don't think we plan to ever compute the stats on the complete data instead of the first 5GB. wdyt @huggingface/dataset-viewer?

lhoestq commented 4 months ago

computing on full datasets is too expensive, we should focus on making the UI clearer IMO