huggingface / dataset-viewer

Lightweight web API for visualizing and exploring any dataset - computer vision, speech, text, and tabular - stored on the Hugging Face Hub
https://huggingface.co/docs/datasets-server
Apache License 2.0
656 stars 67 forks source link

Ensure /first-rows can always be created when truncating (ie: avoid TooBigContentError) #2215

Open severo opened 7 months ago

severo commented 7 months ago

See https://huggingface.co/datasets/mikehemberger/inat_2021_train_mini_plantae

The size of the content of the first rows (358268 B) exceeds the maximum supported size (200000 B) even after truncation. Please report the issue.

Error code:   TooBigContentError

Initially reported in https://github.com/huggingface/datasets-server/issues/1957, but the issue is somewhat different.

severo commented 5 months ago

Interestingly, TooBigContentError is not only used for first-rows steps:

db.cachedResponsesBlue.aggregate([
  { $match: { error_code: "TooBigContentError", "details.copied_from_artifact": { $exists: false } } },
  {
    $group: {
      _id: { kind: "$kind", dataset: "$dataset" },
      count: { $sum: 1 },
    },
  },
  { $group: { _id: { kind: "$_id.kind" }, count: { $sum: 1 } } },
  { $sort: { "_id.kind": 1, count: -1 } },
  { $project: { _id: 0, kind: "$_id.kind", num_datasets: "$count" } } 
]);
{ kind: 'config-parquet-and-info', num_datasets: 11 }
{ kind: 'split-descriptive-statistics', num_datasets: 28 }
{ kind: 'split-first-rows-from-parquet', num_datasets: 670 }
{ kind: 'split-first-rows-from-streaming', num_datasets: 67 }
{ kind: 'split-opt-in-out-urls-scan', num_datasets: 1 }
severo commented 5 months ago

Analysis of https://huggingface.co/datasets/mikehemberger/inat_2021_train_mini_plantae: it contains two columns, one of them is a ClassLabel, which contains 4270 labels with very long names such as 05730_Plantae_Bryophyta_Bryopsida_Bryales_Bryaceae_Rhodobryum_ontariense (see https://datasets-server.huggingface.co/info?dataset=mikehemberger/inat_2021_train_mini_plantae)

So: the features field is two big, not the rows one.

severo commented 3 months ago

See https://huggingface.co/datasets/Cnam-LMSSC/vibravox/discussions/3. It's a dataset with 6 audio columns.

A detail is that the first page (from /first-rows) sometimes exists, depending on the config, while the other ones (/rows) always fail, while the data size should be about the same. It seems incoherent.

severo commented 3 months ago

Another example: https://huggingface.co/datasets/UCSC-VLAA/HQ-Edit/discussions/2

The issue is with the image columns.

severo commented 2 months ago

image column:

zinc75 commented 1 month ago

Would it be feasible to make the default number of rows displayed in the dataset viewer configurable?

For instance, let the dataset maintainer set the number of rows for datasets with large data on rows. This adjustment could mitigate performance issues and avoid dataset-viewer errors.

The number of rows could even dynamically be calculated, based on the largest size for row groups (e.g., max = 286.10 MiB) ?

Or is it a bad idea ?

PonteIneptique commented 5 days ago

Hi, I think we just hit the same issue mentioned here: https://huggingface.co/datasets/CATMuS/medieval-segmentation/viewer/default/train