huggingface / dataset-viewer

Backend that powers the dataset viewer on Hugging Face dataset pages through a public API.
https://huggingface.co/docs/dataset-viewer
Apache License 2.0
698 stars 77 forks source link

FineWeb: Unexpected end of stream: Page was smaller (1862094) than expected (2055611) #2768

Closed lhoestq closed 6 months ago

lhoestq commented 6 months ago

The config-parquet-metadata job succeeds but the split-first-rows job fails when using compute_first_rows_from_parquet_response.

In the meantime I set the error code in the config-parquet-metadata response as CachedResponseNotFound to make the split-first-rows succeed

This workaround causes ResponseNotFound when opening page 2 in the viewer unfortunately (can't do random access in the parquet data without a valid config-parquet-metadata response)

severo commented 6 months ago

The same issue occurs for https://huggingface.co/datasets/prometheus-eval/Feedback-Collection/discussions/4

OSError: Unexpected end of stream: Page was smaller (37332) than expected (542633)
severo commented 6 months ago

The traceback (not necessarily in the right order!)

INFO: 2024-05-03 11:34:08,433 - root - /rows, dataset='prometheus-eval/Feedback-Collection', config='default', split='train', offset=0, length=100
INFO: 2024-05-03 11:34:08,436 - root - Create ParquetIndexWithMetadata for dataset=prometheus-eval/Feedback-Collection, config=default, split=train
INFO: 2024-05-03 11:34:08,436 - root - Query ParquetIndexWithMetadata for dataset=prometheus-eval/Feedback-Collection, config=default, split=train, offset=0, length=100
  File ""/src/services/rows/src/rows/routes/rows.py"", line 101, in rows_endpoint
  File ""pyarrow/_parquet.pyx"", line 1388, in pyarrow._parquet.ParquetReader.read_row_group
  File ""pyarrow/_parquet.pyx"", line 1418, in pyarrow._parquet.ParquetReader.read_row_groups
  File ""pyarrow/error.pxi"", line 91, in pyarrow.lib.check_status
ERROR: 2024-05-03 11:34:08,590 - root - Unexpected error.
  File ""/src/libs/libcommon/src/libcommon/parquet_utils.py"", line 577, in query
  File ""/src/libs/libcommon/src/libcommon/parquet_utils.py"", line 429, in query
    return self.parquet_file.read_row_group(i=self.group_id, columns=columns)
  File ""/src/services/rows/.venv/lib/python3.9/site-packages/pyarrow/parquet/core.py"", line 490, in read_row_group
Traceback (most recent call last):
    pa_table = rows_index.query(offset=offset, length=length)
    return self.parquet_index.query(offset=offset, length=length)
    [
        row_group_readers[i].read(self.supported_columns)
  File ""/src/libs/libcommon/src/libcommon/parquet_utils.py"", line 140, in read
OSError: Unexpected end of stream: Page was smaller (37332) than expected (542633)
  File ""/src/libs/libcommon/src/libcommon/parquet_utils.py"", line 430, in <listcomp>
    return self.reader.read_row_group(i, column_indices=column_indices,
INFO:     10.0.26.51:34806 - ""GET /rows?dataset=prometheus-eval%2FFeedback-Collection&config=default&split=train&offset=0&length=100 HTTP/1.1"" 500 Internal Server Error
severo commented 6 months ago

I recreated the dataset viewer for https://huggingface.co/datasets/prometheus-eval/Feedback-Collection, and it works well now. So, I imagine the issue is in the created Parquet files (or in the metadata file).

Note that before recreating, all the cache entries were 200 - OK

severo commented 6 months ago

For fineweb, the issue is only for the default config (the biggest one), as I was able to browse the pages of any other config.

I just tried to recompute config-parquet-metadata for this config, but it crashed. Trying again

Capture d’écran 2024-05-03 à 14 14 28
severo commented 6 months ago

I think the issue is that we create more than 10k parquet (metadata) files, and thus the parquet file number 10000 is named 0000.parquet, and overwrites the existing one.

Capture d’écran 2024-05-03 à 14 21 22
severo commented 6 months ago

Possibly we have to apply the same technique as https://github.com/huggingface/dataset-viewer/pull/2503 for the parquet metadata files

severo commented 6 months ago

Fineweb has more than 20k parquet files: https://huggingface.co/datasets/HuggingFaceFW/fineweb/tree/refs%2Fconvert%2Fparquet/default

Capture d’écran 2024-05-03 à 14 29 23
severo commented 6 months ago

preparing a PR