huggingface / dataset-viewer

Lightweight web API for visualizing and exploring any dataset - computer vision, speech, text, and tabular - stored on the Hugging Face Hub
https://huggingface.co/docs/datasets-server
Apache License 2.0
656 stars 67 forks source link

autoconverted parquet file has too big cells #1957

Open severo opened 9 months ago

severo commented 9 months ago

See https://huggingface.co/datasets/imvladikon/hebrew_speech_coursera/discussions/1#6523d448b623a04e6c2f118a

From the logs I see this error

TooBigRows: Rows from parquet row groups are too big to be read: 313.33 MiB (max=286.10 MiB)

It looks like an issue on our side: the row groups in the parquet files at https://huggingface.co/datasets/imvladikon/hebrew_speech_coursera/tree/refs%2Fconvert%2Fparquet/default/train are too big to be read by the api. We'll investigate this, thanks for reporting

severo commented 8 months ago

Launched the recreation of imvladikon/hebrew_speech_coursera.

severo commented 8 months ago

-> JobManagerCrashedError 😮

lhoestq commented 8 months ago

UnexpectedApiError for https://huggingface.co/datasets/danielz01/landmarks

libcommon.parquet_utils.TooBigRows: Rows from parquet row groups are too big to be read: 958.13 MiB (max=286.10 MiB)
severo commented 8 months ago

Note that the issue is that the cells are too big (in bytes) and it's not related to the row groups (I was mistaken in the title)

lhoestq commented 8 months ago

Same UnexpectedApiError for https://huggingface.co/datasets/osunlp/Mind2Web, row group is 564MB for 100 rows

severo commented 8 months ago

row group is 564MB for 100 rows

The issue is that we don't allow big "cells". What should we do? Improve the error message? Allow big cells? Truncate?

lhoestq commented 7 months ago

For the UI the best is to truncate, and a bonus would be to let the user click to expand a row

severo commented 7 months ago

so: I think we should add a query parameter, like: "full: boolean", or "truncate: boolean", to /rows, /search, /filter.

severo commented 7 months ago

Also reported here: https://huggingface.co/datasets/UmaDiffusion/ULTIMA/discussions/1

severo commented 7 months ago

Somewhat related: https://huggingface.co/datasets/mikehemberger/inat_2021_train_mini_plantae

We should truncate more aggressively, even for /first-rows

mikehemberger commented 7 months ago

Hi, Thanks for bringing this up. You are probably aware of this but once I click on the „Viewer“, the data is visible there. Best, IMG_5035 IMG_5036

mikehemberger commented 7 months ago

Here is another „raw“ image dataset that I’ve uploaded via the web-interface (assuming it was faster then pushing it from a notebook). Hope this helps Best, M https://huggingface.co/datasets/mikehemberger/medicinal-plants/discussions/2#657c317f1953a4194ad0952d

lhoestq commented 7 months ago

The issue for https://huggingface.co/datasets/mikehemberger/inat_2021_train_mini_plantae is about first_rows truncation, not about autoconverted parquet files no ?

maybe open a separate issue

severo commented 7 months ago

yes, I brought the discussion here, but you're right, the issue is somewhat related. Maybe we can fix both at the same time though.

severo commented 7 months ago

Created https://github.com/huggingface/datasets-server/issues/2215 for https://huggingface.co/datasets/mikehemberger/inat_2021_train_mini_plantae

severo commented 5 months ago

See here too: https://huggingface.co/datasets/ideepankarsharma2003/MidjourneV6_Image_small/discussions/1

severo commented 5 months ago

Another one: https://huggingface.co/datasets/Libertify/stock-sight/discussions/3

twobob commented 1 week ago

was there any action on this?