huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.18k stars 2.67k forks source link

Loading dataset from large GCS bucket very slow since 2.14 #6323

Open jbcdnr opened 1 year ago

jbcdnr commented 1 year ago

Describe the bug

Since updating to >2.14 we have very slow access to our parquet files on GCS when loading a dataset (>30 min vs 3s). Our GCS bucket has many objects and resolving globs is very slow. I could track down the problem to this change: https://github.com/huggingface/datasets/blame/bade7af74437347a760830466eb74f7a8ce0d799/src/datasets/data_files.py#L348 The underlying implementation with gcsfs is really slow. Could you go back to the old way if we are simply giving the parquet files and no glob pattern?

Thank you.

Steps to reproduce the bug

Load a dataset from a GCS bucket that has many files.

Expected behavior

Used to be fast (3s) in 2.13

Environment info

datasets==2.14.5

connermanuel commented 1 month ago

I've also encountered this issue recently and want to ask if this has been seen.

@albertvillanova for visibility - I'm not sure who the right person is to tag, but I noticed you were active recently so perhaps you can direct this to the right person.

Thanks!