Open BootsofLagrangian opened 6 months ago
Issue is updated!
By doing above problem, I met another error.
TypeError: cannot pickle 'torch._C._distributed_c10d.ProcessGroup' object
Same Environment, but
Related Issue : deepspeed strategy can't save checkpoint, TypeError: cannot pickle torch._C._distributed_c10d.ProcessGroup object
I've updated the code as same as the issue https://github.com/huggingface/diffusers/issues/2606. However, I have no idea how to fix cannot pickle
error. The related issue seems to be still open.
Same as title. ~I'm in investigating.~ With following modification, scripts can save optimizer state only for small dataset, IDK why it does not work big dataset.
Error messages will be attached here.
Environment
71e2c91330a9d866ec05cdd10584bbb962896a99
Using
train_network.py
normallyINFO Saving DeepSpeed Model and Optimizer logging.py:61 [rank1]:[E ProcessGroupNCCL.cpp:523] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=6004, OpType=ALLREDUCE, NumelIn=126834688, NumelOut=126834688, Timeout(ms)=600000) ran for 600410 milliseconds before timing out. [rank2]:[E ProcessGroupNCCL.cpp:523] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=6004, OpType=ALLREDUCE, NumelIn=126834688, NumelOut=126834688, Timeout(ms)=600000) ran for 600771 milliseconds before timing out. [rank3]:[E ProcessGroupNCCL.cpp:523] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=6004, OpType=ALLREDUCE, NumelIn=126834688, NumelOut=126834688, Timeout(ms)=600000) ran for 600333 milliseconds before timing out. [rank1]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank1]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down.
if args.save_every_n_steps is not None and global_step % args.save_every_n_steps == 0: accelerator.wait_for_everyone() if accelerator.is_main_process: ckpt_name = train_util.get_step_ckpt_name(args, "." + args.save_model_as, global_step) save_model(ckpt_name, accelerator.unwrap_model(network), global_step, epoch)
if args.save_every_n_steps is not None and global_step % args.save_every_n_steps == 0: accelerator.wait_for_everyone() if accelerator.is_main_process: ckpt_name = train_util.get_step_ckpt_name(args, "." + args.save_model_as, global_step) save_model(ckpt_name, accelerator.unwrap_model(network), global_step, epoch)
[2024-04-09 15:39:37,942] [INFO] [logging.py:96:log_dist] [Rank 0] Saving model checkpoint: ./training/model/temp/test-step00000001-state/pytorch_model/mp_rank_00_model_states.pt [2024-04-09 15:39:37,942] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving ./training/model/temp/test-step00000001-state/pytorch_model/mp_rank_00_model_states.pt... [2024-04-09 15:40:06,463] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved ./training/model/temp/test-step00000001-state/pytorch_model/mp_rank_00_model_states.pt. [2024-04-09 15:40:07,083] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving ./training/model/temp/test-step00000001-state/pytorch_model/zero_pp_rank_0_mp_rank_00_optim_states.pt... [2024-04-09 15:40:07,246] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving ./training/model/temp/test-step00000001-state/pytorch_model/zero_pp_rank_1_mp_rank_00_optim_states.pt... [2024-04-09 15:40:07,246] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving ./training/model/temp/test-step00000001-state/pytorch_model/zero_pp_rank_2_mp_rank_00_optim_states.pt... [2024-04-09 15:40:07,249] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving ./training/model/temp/test-step00000001-state/pytorch_model/zero_pp_rank_3_mp_rank_00_optim_states.pt... [2024-04-09 15:40:08,808] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved ./training/model/temp/test-step00000001-state/pytorch_model/zero_pp_rank_1_mp_rank_00_optim_states.pt. [2024-04-09 15:40:08,808] [INFO] [engine.py:3477:_save_zero_checkpoint] zero checkpoint saved ./training/model/temp/test-step00000001-state/pytorch_model/zero_pp_rank_1_mp_rank_00_optim_states.pt [2024-04-09 15:40:08,808] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint pytorch_model is ready now! [2024-04-09 15:40:08,818] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved ./training/model/temp/test-step00000001-state/pytorch_model/zero_pp_rank_2_mp_rank_00_optim_states.pt. [2024-04-09 15:40:08,818] [INFO] [engine.py:3477:_save_zero_checkpoint] zero checkpoint saved ./training/model/temp/test-step00000001-state/pytorch_model/zero_pp_rank_2_mp_rank_00_optim_states.pt [2024-04-09 15:40:08,819] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint pytorch_model is ready now! [2024-04-09 15:40:08,824] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved ./training/model/temp/test-step00000001-state/pytorch_model/zero_pp_rank_0_mp_rank_00_optim_states.pt. [2024-04-09 15:40:08,832] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved ./training/model/temp/test-step00000001-state/pytorch_model/zero_pp_rank_3_mp_rank_00_optim_states.pt. [2024-04-09 15:40:08,832] [INFO] [engine.py:3477:_save_zero_checkpoint] zero checkpoint saved ./training/model/temp/test-step00000001-state/pytorch_model/zero_pp_rank_3_mp_rank_00_optim_states.pt [2024-04-09 15:40:08,832] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint pytorch_model is ready now! [2024-04-09 15:40:08,883] [INFO] [engine.py:3477:_save_zero_checkpoint] zero checkpoint saved ./training/model/temp/test-step00000001-state/pytorch_model/zero_pp_rank_0_mp_rank_00_optim_states.pt [2024-04-09 15:40:08,884] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint pytorch_model is ready now! 2024-04-09 15:40:08 INFO DeepSpeed Model and Optimizer saved logging.py:61 to output dir
trainer.train(args)
File "/home/hard2251/workspace/sd-scripts/train_network.py", line 949, in train
train_util.save_and_remove_state_stepwise(args, accelerator, global_step)
File "/home/hard2251/workspace/sd-scripts/library/train_util.py", line 4845, in save_and_remove_state_stepwise
accelerator.save_state(state_dir)
File "/home/hard2251/anaconda3/envs/sd-scripts/lib/python3.10/site-packages/accelerate/accelerator.py", line 2706, in save_state
hook(self._models, weights, output_dir)
File "/home/hard2251/workspace/sd-scripts/train_network.py", line 488, in save_model_hook
weights.pop(i)
IndexError: pop from empty list
./training/model/temp/test-step000000
01-state/pytorch_model
Traceback (most recent call last): File "/home/hard2251/workspace/sd-scripts/./sdxl_train_network.py", line 185, in
Before def save_model_hook(models, weights, output_dir):
pop weights of other models than network to save only network weights
After def save_model_hook(models, weights, output_dir):
pop weights of other models than network to save only network weights