huggingface / olm-datasets

Pipeline for pulling and processing online language model pretraining data from the web
Apache License 2.0
173 stars 23 forks source link

Moving large files using `mv`: /bin/mv: Argument list too long #7

Closed spate141 closed 1 year ago

spate141 commented 1 year ago

I recently processed the entire CC and just wanted to point out that this line will cause your script to crash and, in fact, remove all of the downloaded CC data from thetmp_download_dir_name without successfully moving it to download_dir. It's a Linux thing where moving thousands of files with mv can cause an error.

Better solution is mentioned here.

Hope this saves someone some time in future! ✌🏻