huggingface / dataset-viewer

Backend that powers the dataset viewer on Hugging Face dataset pages through a public API.
https://huggingface.co/docs/dataset-viewer
Apache License 2.0
688 stars 76 forks source link

Use `HfFileSystem` in config-parquet-metadata step instead of `HttpFileSystem` #2897

Closed polinaeterna closed 3 months ago

polinaeterna commented 3 months ago

config-parquet-metadata step is failing again for FineWeb with errors like

"Could not read the parquet files: 504, message='Gateway Time-out', url=URL('https://huggingface.co/datasets/HuggingFaceFW/fineweb/resolve/refs%2Fconvert%2Fparquet/default/train-part1/4089.parquet')"

Maybe this would help (the same is used in config-parquet-and-info step which works).

Previous fix was https://github.com/huggingface/dataset-viewer/pull/2884