Open lvwerra opened 2 years ago
using command line git-lfs - [...] 300MB/s!
which server location did you upload from?
From GCP region us-central1-a
.
The most surprising part to me is the saving time. Wondering if it could be due to compression (ParquetWriter
uses SNAPPY compression by default; it can be turned off with to_parquet(..., compression=None)
).
+1 to what @mariosasko mentioned. Also, @lvwerra did you parallelize to_parquet
using similar approach in #2747? (we used multiprocessing at the shard level). I'm working on a similar PR to add multi_proc in to_parquet
which might give you further speed up.
Stas benchmarked his approach and mine in this gist for lama
dataset when we were working on adding multi_proc support for to_json
.
@mariosasko I did not turn it off but I can try the next time - I have to run the pipeline again, anyway.
@bhavitvyamalik Yes, I also sharded the dataset and used multiprocessing to save each shard. I'll have a closer look at your approach, too.
Is there a way to read from the cache files directly as a dataset in its own
Performance of
datasets
at 1TB scaleWhat is this?
During the processing of a large dataset I monitored the performance of the
datasets
library to see if there are any bottlenecks. The insights of this analysis could guide the decision making to improve the performance of the library.Dataset
The dataset is a 1.1TB extract from GitHub with 120M code files and is stored as 5000
.json.gz
files. The goal of the preprocessing is to remove duplicates and filter files based on their stats. While the calculating of the hashes for deduplication and stats for filtering can be parallelized the filtering itself is run with a single process. After processing the files are pushed to the hub.Machine
The experiment was run on a
m1
machine on GCP with 96 CPU cores and 1.3TB RAM.Performance breakdown
Repository.git_push()
)Conclusion
It appears that loading and saving the data is the main bottleneck at that scale (8.5h) whereas processing (2h) and pushing the data to the hub (0.5h) is relatively fast. To optimize the performance at this scale it would make sense to consider such an end-to-end example and target the bottlenecks which seem to be loading from and saving to disk. The processing itself seems to run relatively fast.
Notes
*.lock
file in the data folder (or multiple e.g.*.lock.lock
when it happened a several times). This causes the cache not to be triggered (which is significant at that scale) - i guess because there are new data filesto_parquet
decreased the saving time from 17h to 5h, however adding more workers at this point had almost no effect. not sure if this is: a) a bug in my parallelization logic, b) i/o limit to load data form disk to memory or c) i/o limit to write from memory to disk.Repository.git_push()
was much slower than using command linegit-lfs
- 10-20MB/s vs. 300MB/s! TheDataset.push_to_hub()
function is even slower as it only uploads one file at a time with only a few MB/s, whereasRepository.git_push()
pushes files in parallel (each at a similar speed).cc @lhoestq @julien-c @LysandreJik @SBrandeis