Closed kxhit closed 1 month ago
For anyone who has a similar issue, I also encountered this loading of the internal step to be problematic. Specifically, after adding a try/catch, I found that it succeeds on the master rank but not other ranks. In turn, this causes the ranks to become out of sync, in my case with different amounts of gradient accum in the first step. Ultimately, this can result in a hang later on.
I'm having the same problem. I created a whole new python environment, used pip3 install --force-reinstall -v "accelerate==0.31.0"
to install the older version (followed by datasets
, torchvision
, diffusers
, and tensorboard
, in my case). I was able to resume from a checkpoint at that point.
Thanks for reporting, as correctly stated, downgrading accelerate is the correct workaround.
This was most likely caused by #2765. IMO it would be best if checkpoints were compatible between accelerate versions, so ideally there is a fix that makes the step
key optional to have. Let's see what @muellerzr thinks about this when he's back in office.
I just build my environment so i was running the newest 0.33.0 accelerate version. I saved a checkpoint with this version, when when trying to load it, it throws the key error for "step". I downgraded to 0.31 and its totally fine now, but just thought id mention that even within the same version of accelerate there may be a slight issue.
May I ask is there any plan to fix this issue?
I just build my environment so i was running the newest 0.33.0 accelerate version. I saved a checkpoint with this version, when when trying to load it, it throws the key error for "step". I downgraded to 0.31 and its totally fine now, but just thought id mention that even within the same version of accelerate there may be a slight issue.
This worked for me, same scenario.
PRs https://github.com/huggingface/accelerate/pull/2992 and https://github.com/huggingface/accelerate/pull/2765 seem to deal with this issue and they have already been merged. As far as I can see they haven't been released in a new version yet.
Does anyone know when the next release will be published?
System Info
Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)Reproduction
My training script works fine with accelerate==0.23.0 and when using 0.32.0, and resume from checkpoint (saved by 0.32.0 version as well), I got an error
"accelerate/accelerator.py", line 3147, in load_state self.step = override_attributes["step"] KeyError: 'step'"
Expected behavior
I believe this line causes the error and in accelerate==0.23.0, there is no "step".
Hope to get some suggestion in avoiding this bug or get it fixed.
I downgraded my accelerate to 0.31.0 and it works.