mozilla / firefox-translations-training

Training pipelines for Firefox Translations neural machine translation models
https://mozilla.github.io/firefox-translations-training/
Mozilla Public License 2.0
143 stars 31 forks source link

train-student OSError: [Errno 28] No space left on device #774

Open eu9ene opened 1 month ago

eu9ene commented 1 month ago

https://firefox-ci-tc.services.mozilla.com/tasks/DQeRyr1_TjmXhC0Z-5KWWw/runs/1/logs/public/logs/live.log

It seems we should further bump disk space for student training. It's currently using b-linux-v100-gpu-4-1tb.

Student corpus can be enormous and we decompress it to work with the TSV in OpusTrainer.

OpusTrainer supports gzip, so alternatively we can gzip the corpus before passing it to OpusTrainer or add support for zstd.

eu9ene commented 1 month ago

Ok, the hypothesis now is that OpusTrainer shuffles data on disk and doesn't clean it after it shuffles again for the next epoch.

Screenshot 2024-07-31 at 1 24 51 PM

However, for example, this student training doesn't have this behavior.

Screenshot 2024-07-31 at 1 33 19 PM
eu9ene commented 1 month ago

fetches directory takes only 174GB so using a compressed corpus will not move a needle. There is approximately the same amount of data split to partitions in /tmp at the beginning of training, likely to be shuffled by OpusTrainer.