Closed bhearsum closed 5 months ago
I've kicked off two test runs to validate the latest changes:
I've kicked off two test runs to validate the latest changes:
* https://firefox-ci-tc.services.mozilla.com/tasks/O9QxLh3FQ2uE29lb1y257w is the same as previous runs, where I wait for a checkpoint, simulate a spot termination, and verify that it continues. I've done this once so far, and will do it again after the next checkpoint
The run is now complete, and worked fine with the two simulated spot terminations.
Aside from the included tests, I tested this by hand. What I did was:
train-backwards
job: https://firefox-ci-tc.services.mozilla.com/tasks/bNP6s4FaRwaU7Bz8gxgJ-Q/runs/0artifacts
directory when we continue training, so even though this run didn't checkpoint, it essentially just re-uploaded run #0's work).(I ended up canceling run 2 to avoid wasting resources.)
Additional interpretation of the logs and artifacts is welcome; I'm also happy to do more simulated test runs if that's useful.
Landing this depends on an update to the GPU images that includes the latest spot termintation handling code. I expect this to land in the next day.