mozilla / translations

The code, training pipeline, and models that power Firefox Translations
https://mozilla.github.io/translations/
Mozilla Public License 2.0
155 stars 34 forks source link

Student training continuation is regressed #893

Closed gregtatum closed 1 month ago

gregtatum commented 1 month ago

It looks like #881 broke training continuation for students. There is some mis-direction around the train taskcluster script I think needs updating. I don't think we have run_task tests for training continuation. I'm guessing it's some of the argument manipulation which I don't understand in taskcluster/scripts/pipeline/train_taskcluster.py.

train.py: error: argument --student_model: invalid StudentModel value: 'continue'
gregtatum commented 1 month ago

https://firefox-ci-tc.services.mozilla.com/tasks/Mjiu8_JzTOWdM3BzhZXVcQ/runs/7/logs/public/logs/live.log

eu9ene commented 1 month ago

I'll have a look. Do I understand correctly that we need this working to use preemptible instances? @bhearsum

bhearsum commented 1 month ago

Yeah, we should make sure continuation works if we're using preemptible instances. The other options are: don't use preemptible instances or change it to not try to continue training (which really isn't a good option...).

It looks like https://github.com/mozilla/firefox-translations-training/blob/b0b5f25d0289a90619a12e645683cfd671332a85/taskcluster/scripts/pipeline/train_taskcluster.py#L35-L38 just needs a bump. Sorry for not catching that in review.