huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
18.74k stars 2.58k forks source link

Performance of `datasets` at scale #3735

Open lvwerra opened 2 years ago

lvwerra commented 2 years ago

Performance of datasets at 1TB scale

What is this?

During the processing of a large dataset I monitored the performance of the datasets library to see if there are any bottlenecks. The insights of this analysis could guide the decision making to improve the performance of the library.

Dataset

The dataset is a 1.1TB extract from GitHub with 120M code files and is stored as 5000 .json.gz files. The goal of the preprocessing is to remove duplicates and filter files based on their stats. While the calculating of the hashes for deduplication and stats for filtering can be parallelized the filtering itself is run with a single process. After processing the files are pushed to the hub.

Machine

The experiment was run on a m1 machine on GCP with 96 CPU cores and 1.3TB RAM.

Performance breakdown

Conclusion

It appears that loading and saving the data is the main bottleneck at that scale (8.5h) whereas processing (2h) and pushing the data to the hub (0.5h) is relatively fast. To optimize the performance at this scale it would make sense to consider such an end-to-end example and target the bottlenecks which seem to be loading from and saving to disk. The processing itself seems to run relatively fast.

Notes

cc @lhoestq @julien-c @LysandreJik @SBrandeis

julien-c commented 2 years ago

using command line git-lfs - [...] 300MB/s!

which server location did you upload from?

lvwerra commented 2 years ago

From GCP region us-central1-a.

mariosasko commented 2 years ago

The most surprising part to me is the saving time. Wondering if it could be due to compression (ParquetWriter uses SNAPPY compression by default; it can be turned off with to_parquet(..., compression=None)).

bhavitvyamalik commented 2 years ago

+1 to what @mariosasko mentioned. Also, @lvwerra did you parallelize to_parquet using similar approach in #2747? (we used multiprocessing at the shard level). I'm working on a similar PR to add multi_proc in to_parquet which might give you further speed up. Stas benchmarked his approach and mine in this gist for lama dataset when we were working on adding multi_proc support for to_json.

lvwerra commented 2 years ago

@mariosasko I did not turn it off but I can try the next time - I have to run the pipeline again, anyway.

@bhavitvyamalik Yes, I also sharded the dataset and used multiprocessing to save each shard. I'll have a closer look at your approach, too.

JulesGM commented 2 weeks ago

Is there a way to read from the cache files directly as a dataset in its own