remove sacrebleu datasets with / (I'm still using /dev sometimes)
reduce the number of validation dataset to no more than 5-6
remove medical validation/test dataset mtdata_Lindat-khresmoi_summary
switch stage to "traing-teacher"
switch teacher ensemble to 1
use pre-trained student models where we have them
bump early stopping for en-ru to 30 as we had issues with this language pair in the past and longer training might help
disabled NLLB for back-translations after inspecting ru data. It looks way too noisy and unlike HPLT it didn't go through monocleaner and I'm not sure our simple cleaning rules will be able to handle such noise
added uk-en (we need to retrain this one with our pipeline)
remove smaller datasets if there more than 100 of them (Takslcuster limitation)
removed mtdata dataset with longer names (Taskcluster label length limitation)
I moved the autogenerated configs to the folder autogenerated and manually edited ones to spring-2024.
Fixes:
/
(I'm still using/dev
sometimes)mtdata_Lindat-khresmoi_summary
en-ru
to 30 as we had issues with this language pair in the past and longer training might helpI moved the autogenerated configs to the folder
autogenerated
and manually edited ones tospring-2024
.[skip ci]