mozilla / translations

The code, training pipeline, and models that power Firefox Translations
https://mozilla.github.io/translations/
Mozilla Public License 2.0
154 stars 33 forks source link

train-student OSError: [Errno 28] No space left on device #774

Open eu9ene opened 3 months ago

eu9ene commented 3 months ago

https://firefox-ci-tc.services.mozilla.com/tasks/DQeRyr1_TjmXhC0Z-5KWWw/runs/1/logs/public/logs/live.log

It seems we should further bump disk space for student training. It's currently using b-linux-v100-gpu-4-1tb.

Student corpus can be enormous and we decompress it to work with the TSV in OpusTrainer.

OpusTrainer supports gzip, so alternatively we can gzip the corpus before passing it to OpusTrainer or add support for zstd.

eu9ene commented 3 months ago

Ok, the hypothesis now is that OpusTrainer shuffles data on disk and doesn't clean it after it shuffles again for the next epoch.

Screenshot 2024-07-31 at 1 24 51 PM

However, for example, this student training doesn't have this behavior.

Screenshot 2024-07-31 at 1 33 19 PM
eu9ene commented 3 months ago

fetches directory takes only 174GB so using a compressed corpus will not move a needle. There is approximately the same amount of data split to partitions in /tmp at the beginning of training, likely to be shuffled by OpusTrainer.