Closed severo closed 4 months ago
Please do this. If not possible, at least for the special datasets on request. For example now, special datasets can have a Python file loader with a data viewer at the same time. This can also be a process for some.
only for Parquet-native datasets maybe?
That works.
only for Parquet-native datasets maybe?
I improved the description, because we already allow to "view" all the rows for parquet-native datasets. What we miss is the rest: stats, filter, sort, search, SQL queries in general, because they are run on the 5GB DuckDB export.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
only for Parquet-native datasets maybe?
I improved the description, because we already allow to "view" all the rows for parquet-native datasets. What we miss is the rest: stats, filter, sort, search, SQL queries in general, because they are run on the 5GB DuckDB export.
Can we open this issue again? HF already allow to "view" all the rows for parquet-native datasets. But the stats are wrong as they are run on the 5GB DuckDB export.
We already have an (internal) issue to fix the display (show in the UI if the stats are partial).
I don't think we plan to ever compute the stats on the complete data instead of the first 5GB. wdyt @huggingface/dataset-viewer?
computing on full datasets is too expensive, we should focus on making the UI clearer IMO
The dataset viewer shows statistics and provides filter + sort + search only for the first 5GB of each split. We are also unable to provide the exact number of rows for bigger splits.
Note that we "show" all the rows for parquet-native datasets (i.e., we can access the rows randomly, i.e., we have pagination).
Should we provide a way to increase or remove this limit?