huggingface / accelerate

🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support
https://huggingface.co/docs/accelerate
Apache License 2.0
7.63k stars 927 forks source link

KeyError: 'step' when resume from checkpoint #2923

Closed kxhit closed 1 month ago

kxhit commented 2 months ago

System Info

accelerate 0.32.0

Information

Tasks

Reproduction

My training script works fine with accelerate==0.23.0 and when using 0.32.0, and resume from checkpoint (saved by 0.32.0 version as well), I got an error

"accelerate/accelerator.py", line 3147, in load_state self.step = override_attributes["step"] KeyError: 'step'"

Expected behavior

I believe this line causes the error and in accelerate==0.23.0, there is no "step".

Hope to get some suggestion in avoiding this bug or get it fixed.

I downgraded my accelerate to 0.31.0 and it works.

alexanderswerdlow commented 1 month ago

For anyone who has a similar issue, I also encountered this loading of the internal step to be problematic. Specifically, after adding a try/catch, I found that it succeeds on the master rank but not other ranks. In turn, this causes the ranks to become out of sync, in my case with different amounts of gradient accum in the first step. Ultimately, this can result in a hang later on.

rahji commented 1 month ago

I'm having the same problem. I created a whole new python environment, used pip3 install --force-reinstall -v "accelerate==0.31.0" to install the older version (followed by datasets, torchvision, diffusers, and tensorboard, in my case). I was able to resume from a checkpoint at that point.

BenjaminBossan commented 1 month ago

Thanks for reporting, as correctly stated, downgrading accelerate is the correct workaround.

This was most likely caused by #2765. IMO it would be best if checkpoints were compatible between accelerate versions, so ideally there is a fix that makes the step key optional to have. Let's see what @muellerzr thinks about this when he's back in office.

priyammaz commented 1 month ago

I just build my environment so i was running the newest 0.33.0 accelerate version. I saved a checkpoint with this version, when when trying to load it, it throws the key error for "step". I downgraded to 0.31 and its totally fine now, but just thought id mention that even within the same version of accelerate there may be a slight issue.

rbli-john commented 1 month ago

May I ask is there any plan to fix this issue?

Cuberick-Orion commented 3 weeks ago

I just build my environment so i was running the newest 0.33.0 accelerate version. I saved a checkpoint with this version, when when trying to load it, it throws the key error for "step". I downgraded to 0.31 and its totally fine now, but just thought id mention that even within the same version of accelerate there may be a slight issue.

This worked for me, same scenario.

simonhessner commented 2 weeks ago

PRs https://github.com/huggingface/accelerate/pull/2992 and https://github.com/huggingface/accelerate/pull/2765 seem to deal with this issue and they have already been merged. As far as I can see they haven't been released in a new version yet.

Does anyone know when the next release will be published?