mozilla / translations

The code, training pipeline, and models that power Firefox Translations
https://mozilla.github.io/translations/
Mozilla Public License 2.0
154 stars 33 forks source link

Add support for automatically continuing training from earlier runs of a Task (fixes #270) #580

Closed bhearsum closed 5 months ago

bhearsum commented 5 months ago

Aside from the included tests, I tested this by hand. What I did was:

(I ended up canceling run 2 to avoid wasting resources.)

Additional interpretation of the logs and artifacts is welcome; I'm also happy to do more simulated test runs if that's useful.

Landing this depends on an update to the GPU images that includes the latest spot termintation handling code. I expect this to land in the next day.

bhearsum commented 5 months ago

I've kicked off two test runs to validate the latest changes:

bhearsum commented 5 months ago

605 appears to be breaking finetune-student in the PR triggered tasks :(

I've kicked off two test runs to validate the latest changes:

* https://firefox-ci-tc.services.mozilla.com/tasks/O9QxLh3FQ2uE29lb1y257w is the same as previous runs, where I wait for a checkpoint, simulate a spot termination, and verify that it continues. I've done this once so far, and will do it again after the next checkpoint

The run is now complete, and worked fine with the two simulated spot terminations.