Open eu9ene opened 3 months ago
Ok, the hypothesis now is that OpusTrainer shuffles data on disk and doesn't clean it after it shuffles again for the next epoch.
However, for example, this student training doesn't have this behavior.
fetches
directory takes only 174GB so using a compressed corpus will not move a needle. There is approximately the same amount of data split to partitions in /tmp
at the beginning of training, likely to be shuffled by OpusTrainer.
https://firefox-ci-tc.services.mozilla.com/tasks/DQeRyr1_TjmXhC0Z-5KWWw/runs/1/logs/public/logs/live.log
It seems we should further bump disk space for student training. It's currently using b-linux-v100-gpu-4-1tb.
Student corpus can be enormous and we decompress it to work with the TSV in OpusTrainer.
OpusTrainer supports gzip, so alternatively we can gzip the corpus before passing it to OpusTrainer or add support for zstd.