Open prpercival opened 9 months ago
I am also quite curious about this; if someone could provide insight, I would appreciate it. I am training on a GPU cluster that runs my code for a certain amount of time. Afterward, I need to resubmit the job to reconnect or connect to a node and then resume training. It's not efficient if every time I disconnect, training resumes from scratch. Is there a better way to handle this?
In finetune.py there is the following section to support resuming from a checkpoint, but you may note that
resume_from_checkpoint
is set to false ifpytorch_model.bin
does not exist even though it also seems to support acheckpoint_name
ofadapter_model.bin
. This will cause thefinetune.py
to start from scratch even if a seemingly validresume_from_checkpoint
argument is supplied.If I move line 200 into the if/else on line 204 and line 208 it will resume my fine-tune from my
adapter_model.bin
as expected. Is there something I'm missing here? Should I not be resuming from certain checkpoints?to