I am having issues saving my checkpoints after/during training when running sdxl_train.py on an NVIDIA A10G.
Most of the time I get this printout, but sometimes my instance just freezes:
model as StableDiffusion checkpoint to ./output/SDXL_v1.safetensors
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /root/sd-scripts/venv_torch2/bin/accelerate:8 in │
│ │
│ 5 from accelerate.commands.accelerate_cli import main │
│ 6 if name == 'main': │
│ 7 │ sys.argv[0] = re.sub(r'(-script.pyw|.exe)?$', '', sys.argv[0]) │
│ ❱ 8 │ sys.exit(main()) │
│ 9 │
│ │
│ /root/sd-scripts/venv_torch2/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py: │
│ 45 in main │
│ │
│ 42 │ │ exit(1) │
│ 43 │ │
│ 44 │ # Run │
│ ❱ 45 │ args.func(args) │
│ 46 │
│ 47 │
│ 48 if name == "main": │
│ │
│ /root/sd-scripts/venv_torch2/lib/python3.10/site-packages/accelerate/commands/launch.py:918 in │
│ launch_command │
│ │
│ 915 │ elif defaults is not None and defaults.compute_environment == ComputeEnvironment.AMA │
│ 916 │ │ sagemaker_launcher(defaults, args) │
│ 917 │ else: │
│ ❱ 918 │ │ simple_launcher(args) │
│ 919 │
│ 920 │
│ 921 def main(): │
│ │
│ /root/sd-scripts/venv_torch2/lib/python3.10/site-packages/accelerate/commands/launch.py:580 in │
│ simple_launcher │
│ │
│ 577 │ process.wait() │
│ 578 │ if process.returncode != 0: │
│ 579 │ │ if not args.quiet: │
│ ❱ 580 │ │ │ raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) │
│ 581 │ │ else: │
│ 582 │ │ │ sys.exit(1) │
│ 583 │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
CalledProcessError: Command '['/root/sd-scripts/venv_torch2/bin/python3.10', 'sdxl_train.py', '--config_file=configs/config.toml', '--dataset_config=configs/config_dataset.toml']' died with
<Signals.SIGKILL: 9>.
This is the full printout for my accelerate environment:
B2023-09-12 14:36:20.980965: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-09-12 14:36:22.538160: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Copy-and-paste the text below in your GitHub issue
I am having issues saving my checkpoints after/during training when running sdxl_train.py on an NVIDIA A10G.
Most of the time I get this printout, but sometimes my instance just freezes:
model as StableDiffusion checkpoint to ./output/SDXL_v1.safetensors ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /root/sd-scripts/venv_torch2/bin/accelerate:8 in │
│ │
│ 5 from accelerate.commands.accelerate_cli import main │
│ 6 if name == 'main': │
│ 7 │ sys.argv[0] = re.sub(r'(-script.pyw|.exe)?$', '', sys.argv[0]) │
│ ❱ 8 │ sys.exit(main()) │
│ 9 │
│ │
│ /root/sd-scripts/venv_torch2/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py: │
│ 45 in main │
│ │
│ 42 │ │ exit(1) │
│ 43 │ │
│ 44 │ # Run │
│ ❱ 45 │ args.func(args) │
│ 46 │
│ 47 │
│ 48 if name == "main": │
│ │
│ /root/sd-scripts/venv_torch2/lib/python3.10/site-packages/accelerate/commands/launch.py:918 in │
│ launch_command │
│ │
│ 915 │ elif defaults is not None and defaults.compute_environment == ComputeEnvironment.AMA │
│ 916 │ │ sagemaker_launcher(defaults, args) │
│ 917 │ else: │
│ ❱ 918 │ │ simple_launcher(args) │
│ 919 │
│ 920 │
│ 921 def main(): │
│ │
│ /root/sd-scripts/venv_torch2/lib/python3.10/site-packages/accelerate/commands/launch.py:580 in │
│ simple_launcher │
│ │
│ 577 │ process.wait() │
│ 578 │ if process.returncode != 0: │
│ 579 │ │ if not args.quiet: │
│ ❱ 580 │ │ │ raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) │
│ 581 │ │ else: │
│ 582 │ │ │ sys.exit(1) │
│ 583 │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
CalledProcessError: Command '['/root/sd-scripts/venv_torch2/bin/python3.10', 'sdxl_train.py', '--config_file=configs/config.toml', '--dataset_config=configs/config_dataset.toml']' died with
<Signals.SIGKILL: 9>.
This is the full printout for my accelerate environment:
B2023-09-12 14:36:20.980965: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-09-12 14:36:22.538160: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT Copy-and-paste the text below in your GitHub issue
Accelerate
version: 0.19.0Accelerate
default config:The weird thing is that depending on which file format I try to save in, the line in sdxl_model_util.py that trips the script is different:
Any help would be very appreciated :slightly_smiling_face: