Closed lhoestq closed 6 months ago
The same issue occurs for https://huggingface.co/datasets/prometheus-eval/Feedback-Collection/discussions/4
OSError: Unexpected end of stream: Page was smaller (37332) than expected (542633)
The traceback (not necessarily in the right order!)
INFO: 2024-05-03 11:34:08,433 - root - /rows, dataset='prometheus-eval/Feedback-Collection', config='default', split='train', offset=0, length=100
INFO: 2024-05-03 11:34:08,436 - root - Create ParquetIndexWithMetadata for dataset=prometheus-eval/Feedback-Collection, config=default, split=train
INFO: 2024-05-03 11:34:08,436 - root - Query ParquetIndexWithMetadata for dataset=prometheus-eval/Feedback-Collection, config=default, split=train, offset=0, length=100
File ""/src/services/rows/src/rows/routes/rows.py"", line 101, in rows_endpoint
File ""pyarrow/_parquet.pyx"", line 1388, in pyarrow._parquet.ParquetReader.read_row_group
File ""pyarrow/_parquet.pyx"", line 1418, in pyarrow._parquet.ParquetReader.read_row_groups
File ""pyarrow/error.pxi"", line 91, in pyarrow.lib.check_status
ERROR: 2024-05-03 11:34:08,590 - root - Unexpected error.
File ""/src/libs/libcommon/src/libcommon/parquet_utils.py"", line 577, in query
File ""/src/libs/libcommon/src/libcommon/parquet_utils.py"", line 429, in query
return self.parquet_file.read_row_group(i=self.group_id, columns=columns)
File ""/src/services/rows/.venv/lib/python3.9/site-packages/pyarrow/parquet/core.py"", line 490, in read_row_group
Traceback (most recent call last):
pa_table = rows_index.query(offset=offset, length=length)
return self.parquet_index.query(offset=offset, length=length)
[
row_group_readers[i].read(self.supported_columns)
File ""/src/libs/libcommon/src/libcommon/parquet_utils.py"", line 140, in read
OSError: Unexpected end of stream: Page was smaller (37332) than expected (542633)
File ""/src/libs/libcommon/src/libcommon/parquet_utils.py"", line 430, in <listcomp>
return self.reader.read_row_group(i, column_indices=column_indices,
INFO: 10.0.26.51:34806 - ""GET /rows?dataset=prometheus-eval%2FFeedback-Collection&config=default&split=train&offset=0&length=100 HTTP/1.1"" 500 Internal Server Error
I recreated the dataset viewer for https://huggingface.co/datasets/prometheus-eval/Feedback-Collection, and it works well now. So, I imagine the issue is in the created Parquet files (or in the metadata file).
Note that before recreating, all the cache entries were 200 - OK
For fineweb, the issue is only for the default
config (the biggest one), as I was able to browse the pages of any other config.
I just tried to recompute config-parquet-metadata
for this config, but it crashed. Trying again
I think the issue is that we create more than 10k parquet (metadata) files, and thus the parquet file number 10000 is named 0000.parquet, and overwrites the existing one.
Possibly we have to apply the same technique as https://github.com/huggingface/dataset-viewer/pull/2503 for the parquet metadata files
Fineweb has more than 20k parquet files: https://huggingface.co/datasets/HuggingFaceFW/fineweb/tree/refs%2Fconvert%2Fparquet/default
preparing a PR
The config-parquet-metadata job succeeds but the split-first-rows job fails when using
compute_first_rows_from_parquet_response
.In the meantime I set the error code in the config-parquet-metadata response as
CachedResponseNotFound
to make the split-first-rows succeedThis workaround causes
ResponseNotFound
when opening page 2 in the viewer unfortunately (can't do random access in the parquet data without a valid config-parquet-metadata response)