kohya-ss / sd-scripts

Apache License 2.0
5.16k stars 860 forks source link

Flux Dreambooth crashes after trying to save checkpoint #1476

Open DarkViewAI opened 2 months ago

DarkViewAI commented 2 months ago

I am able to successful db training and samples look good, but after 600 steps it gives me this error

2024-08-19 07:57:06 INFO train_util.py:5123 INFO saving checkpoint: train_util.py:5124 /home/Ubuntu/Downloads/model 4/Rick-step00000600.safetensors Traceback (most recent call last): File "/home/Ubuntu/apps/kohya_ss/venv/bin/accelerate", line 8, in sys.exit(main()) File "/home/Ubuntu/apps/kohya_ss/venv/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main args.func(args) File "/home/Ubuntu/apps/kohya_ss/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1106, in launch_command simple_launcher(args) File "/home/Ubuntu/apps/kohya_ss/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 704, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['/home/Ubuntu/apps/kohya_ss/venv/bin/python', '/home/Ubuntu/apps/kohya_ss/sd-scripts/flux_train.py', '--config_file', '/home/Ubuntu/Downloads/model 4/config_dreambooth-20240819-073001.toml', '--fp8_base', '--highvram', '--cpu_offload_checkpointing']' died with <Signals.SIGKILL: 9>.

kohya-ss commented 2 months ago

The latest version reduces the peak RAM usage. Please update the repo.

ClipSkipper commented 2 months ago

I am able to successful db training and samples look good, but after 600 steps it gives me this error

2024-08-19 07:57:06 INFO train_util.py:5123 INFO saving checkpoint: train_util.py:5124 /home/Ubuntu/Downloads/model 4/Rick-step00000600.safetensors Traceback (most recent call last): File "/home/Ubuntu/apps/kohya_ss/venv/bin/accelerate", line 8, in sys.exit(main()) File "/home/Ubuntu/apps/kohya_ss/venv/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main args.func(args) File "/home/Ubuntu/apps/kohya_ss/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1106, in launch_command simple_launcher(args) File "/home/Ubuntu/apps/kohya_ss/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 704, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['/home/Ubuntu/apps/kohya_ss/venv/bin/python', '/home/Ubuntu/apps/kohya_ss/sd-scripts/flux_train.py', '--config_file', '/home/Ubuntu/Downloads/model 4/config_dreambooth-20240819-073001.toml', '--fp8_base', '--highvram', '--cpu_offload_checkpointing']' died with <Signals.SIGKILL: 9>.

Did this resolve for you?, I get hit with a similar error at the beginning of training.