Open jbcdnr opened 1 year ago
I've also encountered this issue recently and want to ask if this has been seen.
@albertvillanova for visibility - I'm not sure who the right person is to tag, but I noticed you were active recently so perhaps you can direct this to the right person.
Thanks!
Describe the bug
Since updating to >2.14 we have very slow access to our parquet files on GCS when loading a dataset (>30 min vs 3s). Our GCS bucket has many objects and resolving globs is very slow. I could track down the problem to this change: https://github.com/huggingface/datasets/blame/bade7af74437347a760830466eb74f7a8ce0d799/src/datasets/data_files.py#L348 The underlying implementation with gcsfs is really slow. Could you go back to the old way if we are simply giving the parquet files and no glob pattern?
Thank you.
Steps to reproduce the bug
Load a dataset from a GCS bucket that has many files.
Expected behavior
Used to be fast (3s) in 2.13
Environment info
datasets==2.14.5