kohya-ss / sd-scripts

Apache License 2.0
4.94k stars 828 forks source link

Unable to save SDXL checkpoint #818

Closed kosmicdream closed 1 year ago

kosmicdream commented 1 year ago

I am having issues saving my checkpoints after/during training when running sdxl_train.py on an NVIDIA A10G.

Most of the time I get this printout, but sometimes my instance just freezes:

model as StableDiffusion checkpoint to ./output/SDXL_v1.safetensors ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /root/sd-scripts/venv_torch2/bin/accelerate:8 in │ │ │ │ 5 from accelerate.commands.accelerate_cli import main │ │ 6 if name == 'main': │ │ 7 │ sys.argv[0] = re.sub(r'(-script.pyw|.exe)?$', '', sys.argv[0]) │ │ ❱ 8 │ sys.exit(main()) │ │ 9 │ │ │ │ /root/sd-scripts/venv_torch2/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py: │ │ 45 in main │ │ │ │ 42 │ │ exit(1) │ │ 43 │ │ │ 44 │ # Run │ │ ❱ 45 │ args.func(args) │ │ 46 │ │ 47 │ │ 48 if name == "main": │ │ │ │ /root/sd-scripts/venv_torch2/lib/python3.10/site-packages/accelerate/commands/launch.py:918 in │ │ launch_command │ │ │ │ 915 │ elif defaults is not None and defaults.compute_environment == ComputeEnvironment.AMA │ │ 916 │ │ sagemaker_launcher(defaults, args) │ │ 917 │ else: │ │ ❱ 918 │ │ simple_launcher(args) │ │ 919 │ │ 920 │ │ 921 def main(): │ │ │ │ /root/sd-scripts/venv_torch2/lib/python3.10/site-packages/accelerate/commands/launch.py:580 in │ │ simple_launcher │ │ │ │ 577 │ process.wait() │ │ 578 │ if process.returncode != 0: │ │ 579 │ │ if not args.quiet: │ │ ❱ 580 │ │ │ raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) │ │ 581 │ │ else: │ │ 582 │ │ │ sys.exit(1) │ │ 583 │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ CalledProcessError: Command '['/root/sd-scripts/venv_torch2/bin/python3.10', 'sdxl_train.py', '--config_file=configs/config.toml', '--dataset_config=configs/config_dataset.toml']' died with <Signals.SIGKILL: 9>.

This is the full printout for my accelerate environment:

B2023-09-12 14:36:20.980965: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-09-12 14:36:22.538160: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT Copy-and-paste the text below in your GitHub issue

The weird thing is that depending on which file format I try to save in, the line in sdxl_model_util.py that trips the script is different:

Any help would be very appreciated :slightly_smiling_face:

kosmicdream commented 1 year ago

Problem solved, I didn't have enough good old RAM