huggingface / dataset-viewer

Lightweight web API for visualizing and exploring any dataset - computer vision, speech, text, and tabular - stored on the Hugging Face Hub
https://huggingface.co/docs/datasets-server
Apache License 2.0
640 stars 65 forks source link

Add retry mechanism to get_parquet_file in parquet metadata step #2884

Closed polinaeterna closed 1 month ago

polinaeterna commented 1 month ago

...to see if it helps with FineWeb config-parquet-metadata issue.

Currently the error says just Server disconnected which seems to be a aiohttp.ServerDisconnectedError error.

If that works, a more fundamental solution would be to completely switch to HfFyleSystem instead of HTTPFileSystem and remove retries.

polinaeterna commented 1 month ago

seems that it helped! but it was slow indeed :( almost half an hour image image

lhoestq commented 1 month ago

Great ! Given the number of files it's not absurd ^^