Modelarts supports automatic training resume. But the resume ckpt saving method should be improved to adapt this.
Before: train_resume.ckpt will save in a different directory formed by a different timestamp after each launch of resuming training. If the training will be resumed multiple times, each time we need to manually modify the train_resume.ckpt path (i.e. args.resume) to ensure that the LATEST train_resume.ckpt instead of the older one is loaded.
After: train_resume.ckpt is always saved in args.output_path. Set args.resume to be True at the first launch of training, then the latest train_resume.ckpt can be loaded correctly even if the training will be resumed multiple times, without setting args.resume again manually.
[x] Did you make sure to update the documentation with your changes? E.g. record bug fixes or new features in What's New. Here are the
documentation guidelines
[x] Did you build and run the code without any errors?
[x] Did you report the running environment (NPU type/MS version) and performance in the doc? (better record it for data loading, model inference, or training tasks)
What does this PR do?
Adds # (feature)
Modelarts supports automatic training resume. But the resume ckpt saving method should be improved to adapt this.
Before:
train_resume.ckpt
will save in a different directory formed by a different timestamp after each launch of resuming training. If the training will be resumed multiple times, each time we need to manually modify thetrain_resume.ckpt
path (i.e.args.resume
) to ensure that the LATEST train_resume.ckpt instead of the older one is loaded.After:
train_resume.ckpt
is always saved inargs.output_path
. Setargs.resume
to beTrue
at the first launch of training, then the latesttrain_resume.ckpt
can be loaded correctly even if the training will be resumed multiple times, without settingargs.resume
again manually.Before submitting
What's New
. Here are the documentation guidelines