kohya-ss / sd-scripts

Apache License 2.0
5.02k stars 843 forks source link

Failed to resume from state #1524

Open tristanwqy opened 1 month ago

tristanwqy commented 1 month ago
load train state from /home/ubuntu/MyFiles/xiaodi/training/output/flux/sd-scripts/flux_lora_200k/flux_lora_200k-step00002250-state/train_state.json: {'current_epoch': 1, 'current_step': 2250}
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/ubuntu/sd-scripts/flux_train_network.py", line 411, in <module>
[rank0]:     trainer.train(args)
[rank0]:   File "/home/ubuntu/sd-scripts/train_network.py", line 663, in train
[rank0]:     train_util.resume_from_local_or_hf_if_specified(accelerator, args)
[rank0]:   File "/home/ubuntu/sd-scripts/library/train_util.py", line 4260, in resume_from_local_or_hf_if_specified
[rank0]:     accelerator.load_state(args.resume)
[rank0]:   File "/home/ubuntu/miniconda3/envs/sd-scripts/lib/python3.11/site-packages/accelerate/accelerator.py", line 3145, in load_state
[rank0]:     override_attributes = load_accelerator_state(
[rank0]:                           ^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/ubuntu/miniconda3/envs/sd-scripts/lib/python3.11/site-packages/accelerate/checkpointing.py", line 208, in load_accelerator_state
[rank0]:     load_model(model, input_model_file, device=str(map_location), **load_model_func_kwargs)
[rank0]: TypeError: load_model() got an unexpected keyword argument 'device'

It caused by the wrong version safetensors==0.4.2 in requirements.txt, safetensors add this argument in this commit https://github.com/huggingface/safetensors/commit/ff643a874414bf976ebe6857c59320f1e8f4e4b4, upgrade to safetensors==0.4.4 solved this problem.

another error occupied is

[rank2]: Traceback (most recent call last):
[rank2]:   File "/home/ubuntu/sd-scripts/flux_train_network.py", line 411, in <module>
[rank2]:     trainer.train(args)
[rank2]:   File "/home/ubuntu/sd-scripts/train_network.py", line 663, in train
[rank2]:     train_util.resume_from_local_or_hf_if_specified(accelerator, args)
[rank2]:   File "/home/ubuntu/sd-scripts/library/train_util.py", line 4260, in resume_from_local_or_hf_if_specified
[rank2]:     accelerator.load_state(args.resume)
[rank2]:   File "/home/ubuntu/miniconda3/envs/sd-scripts/lib/python3.11/site-packages/accelerate/accelerator.py", line 3156, in load_state
[rank2]:     self.step = override_attributes["step"]

downgrade accelerate to accelerate==0.31.0 solved this problem

kohya-ss commented 1 month ago

Thank you! I've updated requirements.txt for the first issue. I will investigate the 2nd issue.

kohya-ss commented 1 month ago

The second issue doesn't seem to happen to me, any idea what could be causing it?

tristanwqy commented 1 month ago

The second issue doesn't seem to happen to me, any idea what could be causing it?

the version of accelerate, I fixed it by downgrade to acccelerate==0.31.0, according to a issue in accelerate library https://github.com/huggingface/accelerate/issues/2923