Open severo opened 7 months ago
Interestingly, TooBigContentError
is not only used for first-rows steps:
db.cachedResponsesBlue.aggregate([
{ $match: { error_code: "TooBigContentError", "details.copied_from_artifact": { $exists: false } } },
{
$group: {
_id: { kind: "$kind", dataset: "$dataset" },
count: { $sum: 1 },
},
},
{ $group: { _id: { kind: "$_id.kind" }, count: { $sum: 1 } } },
{ $sort: { "_id.kind": 1, count: -1 } },
{ $project: { _id: 0, kind: "$_id.kind", num_datasets: "$count" } }
]);
{ kind: 'config-parquet-and-info', num_datasets: 11 }
{ kind: 'split-descriptive-statistics', num_datasets: 28 }
{ kind: 'split-first-rows-from-parquet', num_datasets: 670 }
{ kind: 'split-first-rows-from-streaming', num_datasets: 67 }
{ kind: 'split-opt-in-out-urls-scan', num_datasets: 1 }
Analysis of https://huggingface.co/datasets/mikehemberger/inat_2021_train_mini_plantae: it contains two columns, one of them is a ClassLabel, which contains 4270 labels with very long names such as 05730_Plantae_Bryophyta_Bryopsida_Bryales_Bryaceae_Rhodobryum_ontariense
(see https://datasets-server.huggingface.co/info?dataset=mikehemberger/inat_2021_train_mini_plantae)
So: the features
field is two big, not the rows
one.
See https://huggingface.co/datasets/Cnam-LMSSC/vibravox/discussions/3. It's a dataset with 6 audio columns.
A detail is that the first page (from /first-rows) sometimes exists, depending on the config, while the other ones (/rows) always fail, while the data size should be about the same. It seems incoherent.
Another example: https://huggingface.co/datasets/UCSC-VLAA/HQ-Edit/discussions/2
The issue is with the image columns.
image column:
Would it be feasible to make the default number of rows displayed in the dataset viewer configurable?
For instance, let the dataset maintainer set the number of rows for datasets with large data on rows. This adjustment could mitigate performance issues and avoid dataset-viewer errors.
The number of rows could even dynamically be calculated, based on the largest size for row groups (e.g., max = 286.10 MiB) ?
Or is it a bad idea ?
Hi, I think we just hit the same issue mentioned here: https://huggingface.co/datasets/CATMuS/medieval-segmentation/viewer/default/train
See https://huggingface.co/datasets/mikehemberger/inat_2021_train_mini_plantae
Initially reported in https://github.com/huggingface/datasets-server/issues/1957, but the issue is somewhat different.