Issue with training SDXL Textual Inversion in multi GPU environment

DKnight54 commented 11 months ago

I've been trying out textual inversion for SDXL and while it seems to run fine in single GPU enviroment, trying to train in dual GPU enviroment throws a AttributeErrorAttributeError: : 'DistributedDataParallel' object has no attribute 'text_model''DistributedDataParallel' object has no attribute 'text_model' error, even with the --ddp_gradient_as_bucket_view and --ddp_bucket_view

Full stack trace as below:

Traceback (most recent call last):
  File "/kaggle/temp/sd-scripts/sdxl_train_textual_inversion.py", line 141, in <module>
  File "/kaggle/temp/sd-scripts/sdxl_train_textual_inversion.py", line 141, in <module>
    trainer.train(args)
  File "/kaggle/temp/sd-scripts/train_textual_inversion.py", line 444, in train
    trainer.train(args)
  File "/kaggle/temp/sd-scripts/train_textual_inversion.py", line 444, in train
    text_encoder.text_model.encoder.requires_grad_(False)
  File "/kaggle/temp/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1614, in __getattr__
    text_encoder.text_model.encoder.requires_grad_(False)
  File "/kaggle/temp/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1614, in __getattr__
        raise AttributeError("'{}' object has no attribute '{}'".format(raise AttributeError("'{}' object has no attribute '{}'".format(

AttributeErrorAttributeError: : 'DistributedDataParallel' object has no attribute 'text_model''DistributedDataParallel' object has no attribute 'text_model'

WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 749 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 748) of binary: /kaggle/temp/venv/bin/python
Traceback (most recent call last):
  File "/kaggle/temp/venv/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/kaggle/temp/venv/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
    args.func(args)
  File "/kaggle/temp/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 977, in launch_command
    multi_gpu_launcher(args)
  File "/kaggle/temp/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 646, in multi_gpu_launcher
    distrib_run.run(args)
  File "/kaggle/temp/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/kaggle/temp/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/kaggle/temp/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
sdxl_train_textual_inversion.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-12-24_05:40:05
  host      : 10a2396fa887
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 748)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

kohya-ss commented 11 months ago

I updated dev branch. I didn't test it with DDP, but I hope it fixes this issue.

DKnight54 commented 11 months ago

Hey @kohya-ss,

Tried out the dev branch and it's able to start training, but I ran into a problem while generating sample images. Stragely, it looks like it might be related to this issue with generating samples images while training LoRAs as well.

Traceback (most recent call last):
  File "/kaggle/temp/sd-scripts/sdxl_train_textual_inversion.py", line 141, in <module>
    trainer.train(args)
  File "/kaggle/temp/sd-scripts/train_textual_inversion.py", line 533, in train
    self.sample_images(
  File "/kaggle/temp/sd-scripts/sdxl_train_textual_inversion.py", line 89, in sample_images
    sdxl_train_util.sample_images(
  File "/kaggle/temp/sd-scripts/library/sdxl_train_util.py", line 367, in sample_images
    return train_util.sample_images_common(SdxlStableDiffusionLongPromptWeightingPipeline, *args, **kwargs)
  File "/kaggle/temp/sd-scripts/library/train_util.py", line 4758, in sample_images_common
    image = pipeline.latents_to_image(latents)[0]
  File "/kaggle/temp/sd-scripts/library/sdxl_lpw_stable_diffusion.py", line 1035, in latents_to_image
    image = self.decode_latents(latents.to(self.vae.dtype))
  File "/kaggle/temp/sd-scripts/library/sdxl_lpw_stable_diffusion.py", line 714, in decode_latents
    image = self.vae.decode(latents.to(self.vae.dtype)).sample
  File "/kaggle/temp/venv/lib/python3.10/site-packages/diffusers/utils/accelerate_utils.py", line 46, in wrapper
    return method(self, *args, **kwargs)
  File "/kaggle/temp/venv/lib/python3.10/site-packages/diffusers/models/autoencoder_kl.py", line 286, in decode
    decoded = self._decode(z).sample
  File "/kaggle/temp/venv/lib/python3.10/site-packages/diffusers/models/autoencoder_kl.py", line 273, in _decode
    dec = self.decoder(z)
  File "/kaggle/temp/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/kaggle/temp/venv/lib/python3.10/site-packages/diffusers/models/vae.py", line 272, in forward
    sample = up_block(sample, latent_embeds)
  File "/kaggle/temp/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/kaggle/temp/venv/lib/python3.10/site-packages/diffusers/models/unet_2d_blocks.py", line 2375, in forward
    hidden_states = upsampler(hidden_states)
  File "/kaggle/temp/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/kaggle/temp/venv/lib/python3.10/site-packages/diffusers/models/resnet.py", line 170, in forward
    hidden_states = self.conv(hidden_states, scale)
  File "/kaggle/temp/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/kaggle/temp/venv/lib/python3.10/site-packages/diffusers/models/lora.py", line 163, in forward
    return F.conv2d(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1024.00 MiB (GPU 0; 14.76 GiB total capacity; 12.29 GiB already allocated; 905.75 MiB free; 12.82 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
steps:   0%|                                            | 0/120 [00:35<?, ?it/s]
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 748 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 747) of binary: /kaggle/temp/venv/bin/python
Traceback (most recent call last):
  File "/kaggle/temp/venv/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/kaggle/temp/venv/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
    args.func(args)
  File "/kaggle/temp/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 977, in launch_command
    multi_gpu_launcher(args)
  File "/kaggle/temp/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 646, in multi_gpu_launcher
    distrib_run.run(args)
  File "/kaggle/temp/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/kaggle/temp/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/kaggle/temp/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
sdxl_train_textual_inversion.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-12-24_15:09:43
  host      : ba06d8ffffab
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 747)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

DKnight54 commented 11 months ago

Hrm.. Decided to give up on sample image generation during training for now, but have been running into this issue instead at the start of training

steps:   0%|                                           | 0/6840 [00:00<?, ?it/s]
epoch 1/6
Traceback (most recent call last):
  File "/kaggle/temp/sd-scripts/sdxl_train_textual_inversion.py", line 141, in <module>
    trainer.train(args)
  File "/kaggle/temp/sd-scripts/train_textual_inversion.py", line 619, in train
    accelerator.unwrap_model(text_encoder).get_input_embeddings().weight[
RuntimeError: Index put requires the source and destination dtypes match, got Half for the destination and Float for the source.
Traceback (most recent call last):
  File "/kaggle/temp/sd-scripts/sdxl_train_textual_inversion.py", line 141, in <module>
    trainer.train(args)
  File "/kaggle/temp/sd-scripts/train_textual_inversion.py", line 619, in train
    accelerator.unwrap_model(text_encoder).get_input_embeddings().weight[
RuntimeError: Index put requires the source and destination dtypes match, got Half for the destination and Float for the source.
steps:   0%|                                           | 0/6840 [00:13<?, ?it/s]
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1544 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 1545) of binary: /kaggle/temp/venv/bin/python
Traceback (most recent call last):
  File "/kaggle/temp/venv/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/kaggle/temp/venv/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
    args.func(args)
  File "/kaggle/temp/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 977, in launch_command
    multi_gpu_launcher(args)
  File "/kaggle/temp/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 646, in multi_gpu_launcher
    distrib_run.run(args)
  File "/kaggle/temp/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/kaggle/temp/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/kaggle/temp/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
sdxl_train_textual_inversion.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-12-30_16:43:49
  host      : 56716c7ca920
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 1545)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

kohya-ss commented 11 months ago

Sorry for the delay. full_fp16 or full_bf16 seems to cause this error. I've updated dev branch again. It will fix the error.

DKnight54 commented 11 months ago

No worries, hope it was because you were enjoying the holidays with friends and families. I'll test it as soon as I can.

DKnight54 commented 11 months ago

Got a new error this time

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 735) of binary: /kaggle/temp/venv/bin/python
Traceback (most recent call last):
  File "/kaggle/temp/venv/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/kaggle/temp/venv/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
    args.func(args)
  File "/kaggle/temp/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1008, in launch_command
    multi_gpu_launcher(args)
  File "/kaggle/temp/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 666, in multi_gpu_launcher
    distrib_run.run(args)
  File "/kaggle/temp/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/kaggle/temp/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/kaggle/temp/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
sdxl_train_textual_inversion.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-01-06_03:29:24
  host      : 3f8eb37d2946
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 736)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-01-06_03:29:24
  host      : 3f8eb37d2946
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 735)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

kohya-ss commented 11 months ago

Can you please share a few more logs above as the log message does not include the cause.

DKnight54 commented 11 months ago

Oops, sorry, Thought I had already included the full traceback The params: accelerate launch --config_file="/kaggle/temp/sd-scripts/accelerate_config/config.yaml" --num_cpu_threads_per_process=1 sdxl_train_textual_inversion.py --sample_prompts="/kaggle/temp/LoRA/train_data/json/sample_prompt_ti.toml" --no_half_vae --shuffle_caption --ddp_gradient_as_bucket_view --ddp_static_graph --pretrained_model_name_or_path="/kaggle/input/stable-diffusion-xl/pytorch/base-1-0/1" --vae="/kaggle/temp/vae/sdxl_vae.safetensors" --output_dir="/kaggle/temp/output/TCL_Waifu_TI" --output_name="TCL_Waifu_TI" --token_string="TCL_Waifu" --init_word="woman" --num_vectors_per_token=10 --save_precision="fp16" --save_every_n_epochs=1 --train_batch_size=4 --max_token_length=225 --mem_eff_attn --sdpa --max_train_epochs=6 --gradient_checkpointing --gradient_accumulation_steps=1 --mixed_precision="fp16" --cache_latents_to_disk --prior_loss_weight=1.0 --full_fp16 --sample_every_n_epochs=1 --sample_sampler="k_euler_a" --save_model_as="safetensors" --huggingface_token="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" --optimizer_type="AdaFactor" --learning_rate=1e-06 --max_grad_norm=0 --lr_scheduler="constant" --lr_warmup_steps=0 --dataset_config="/kaggle/temp/LoRA/train_data/json/dataset_config_TI.toml"

The traceback

Traceback (most recent call last):
  File "/kaggle/temp/sd-scripts/sdxl_train_textual_inversion.py", line 141, in <module>
    trainer.train(args)
  File "/kaggle/temp/sd-scripts/train_textual_inversion.py", line 605, in train
    accelerator.backward(loss)
  File "/kaggle/temp/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1903, in backward
    self.scaler.scale(loss).backward(**kwargs)
  File "/kaggle/temp/venv/lib/python3.10/site-packages/torch/_tensor.py", line 487, in backward
    torch.autograd.backward(
  File "/kaggle/temp/venv/lib/python3.10/site-packages/torch/autograd/__init__.py", line 200, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
SystemError: <built-in method run_backward of torch._C._EngineBase object at 0x7a102a23f410> returned NULL without setting an exception
Traceback (most recent call last):
  File "/kaggle/temp/sd-scripts/sdxl_train_textual_inversion.py", line 141, in <module>
    trainer.train(args)
  File "/kaggle/temp/sd-scripts/train_textual_inversion.py", line 605, in train
    accelerator.backward(loss)
  File "/kaggle/temp/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1903, in backward
    self.scaler.scale(loss).backward(**kwargs)
  File "/kaggle/temp/venv/lib/python3.10/site-packages/torch/_tensor.py", line 487, in backward
    torch.autograd.backward(
  File "/kaggle/temp/venv/lib/python3.10/site-packages/torch/autograd/__init__.py", line 200, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
SystemError: <built-in method run_backward of torch._C._EngineBase object at 0x79c4683c3410> returned NULL without setting an exception
steps:   0%|                                           | 0/1728 [00:16<?, ?it/s]
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 735) of binary: /kaggle/temp/venv/bin/python
Traceback (most recent call last):
  File "/kaggle/temp/venv/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/kaggle/temp/venv/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
    args.func(args)
  File "/kaggle/temp/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1008, in launch_command
    multi_gpu_launcher(args)
  File "/kaggle/temp/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 666, in multi_gpu_launcher
    distrib_run.run(args)
  File "/kaggle/temp/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/kaggle/temp/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/kaggle/temp/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
sdxl_train_textual_inversion.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-01-06_03:29:24
  host      : 3f8eb37d2946
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 736)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-01-06_03:29:24
  host      : 3f8eb37d2946
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 735)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

kohya-ss commented 11 months ago

Thank you! This error seems to be related to https://github.com/pytorch/pytorch/issues/75750. So it seems that the correct error message is not being displayed. Since this PyTorch issue seems to have already been resolved, could you please update PyTorch to the latest version and try again to see what error message you get.

DKnight54 commented 11 months ago

Interesting. Even running the current Torch 2.1.2 gives the identical error.

One... interesting clue is running with only the --ddp_gradient_as_bucket_view flag seems to allow it to complete one step before failing. Running without both currently works in the dev version.

With --ddp_gradient_as_bucket_view and --ddp_static_graph

Traceback (most recent call last):
  File "/kaggle/temp/sd-scripts/sdxl_train_textual_inversion.py", line 141, in <module>
    trainer.train(args)
  File "/kaggle/temp/sd-scripts/train_textual_inversion.py", line 607, in train
    accelerator.backward(loss)
  File "/kaggle/temp/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1903, in backward
    self.scaler.scale(loss).backward(**kwargs)
  File "/kaggle/temp/venv/lib/python3.10/site-packages/torch/_tensor.py", line 492, in backward
    torch.autograd.backward(
  File "/kaggle/temp/venv/lib/python3.10/site-packages/torch/autograd/__init__.py", line 251, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
SystemError: <built-in method run_backward of torch._C._EngineBase object at 0x7cbc30f73850> returned NULL without setting an exception
steps:   0%|                                            | 0/462 [00:16<?, ?it/s]
Traceback (most recent call last):
  File "/kaggle/temp/sd-scripts/sdxl_train_textual_inversion.py", line 141, in <module>
    trainer.train(args)
  File "/kaggle/temp/sd-scripts/train_textual_inversion.py", line 607, in train
    accelerator.backward(loss)
  File "/kaggle/temp/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1903, in backward
    self.scaler.scale(loss).backward(**kwargs)
  File "/kaggle/temp/venv/lib/python3.10/site-packages/torch/_tensor.py", line 492, in backward
    torch.autograd.backward(
  File "/kaggle/temp/venv/lib/python3.10/site-packages/torch/autograd/__init__.py", line 251, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
SystemError: <built-in method run_backward of torch._C._EngineBase object at 0x7bf81259b850> returned NULL without setting an exception
[2024-01-14 13:56:22,964] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 839) of binary: /kaggle/temp/venv/bin/python
Traceback (most recent call last):
  File "/kaggle/temp/venv/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/kaggle/temp/venv/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
    args.func(args)
  File "/kaggle/temp/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1008, in launch_command
    multi_gpu_launcher(args)
  File "/kaggle/temp/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 666, in multi_gpu_launcher
    distrib_run.run(args)
  File "/kaggle/temp/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/kaggle/temp/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/kaggle/temp/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
sdxl_train_textual_inversion.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-01-14_13:56:22
  host      : e9d4d6462f9d
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 840)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-01-14_13:56:22
  host      : e9d4d6462f9d
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 839)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
ERROR: ld.so: object '/kaggle/temp/libtcmalloc_minimal.so.4' from LD_PRELOAD cannot be preloaded (file too short): ignored.
ERROR: ld.so: object '/kaggle/temp/libtcmalloc_minimal.so.4' from LD_PRELOAD cannot be preloaded (file too short): ignored.
ERROR: ld.so: object '/kaggle/temp/libtcmalloc_minimal.so.4' from LD_PRELOAD cannot be preloaded (file too short): ignored.
Name: torch
Version: 2.1.2
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: packages@pytorch.org
License: BSD-3
Location: /kaggle/temp/venv/lib/python3.10/site-packages
Requires: filelock, fsspec, jinja2, networkx, nvidia-cublas-cu12, nvidia-cuda-cupti-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-runtime-cu12, nvidia-cudnn-cu12, nvidia-cufft-cu12, nvidia-curand-cu12, nvidia-cusolver-cu12, nvidia-cusparse-cu12, nvidia-nccl-cu12, nvidia-nvtx-cu12, sympy, triton, typing-extensions
Required-by: accelerate, open-clip-torch, pytorch-lightning, timm, torchmetrics, torchvision

With --ddp_static_graph only

Traceback (most recent call last):
  File "/kaggle/temp/sd-scripts/sdxl_train_textual_inversion.py", line 141, in <module>
    trainer.train(args)
  File "/kaggle/temp/sd-scripts/train_textual_inversion.py", line 607, in train
    accelerator.backward(loss)
  File "/kaggle/temp/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1903, in backward
    self.scaler.scale(loss).backward(**kwargs)
  File "/kaggle/temp/venv/lib/python3.10/site-packages/torch/_tensor.py", line 492, in backward
    torch.autograd.backward(
  File "/kaggle/temp/venv/lib/python3.10/site-packages/torch/autograd/__init__.py", line 251, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
SystemError: <built-in method run_backward of torch._C._EngineBase object at 0x7aff5729f850> returned NULL without setting an exception
Traceback (most recent call last):
  File "/kaggle/temp/sd-scripts/sdxl_train_textual_inversion.py", line 141, in <module>
    trainer.train(args)
  File "/kaggle/temp/sd-scripts/train_textual_inversion.py", line 607, in train
    accelerator.backward(loss)
  File "/kaggle/temp/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1903, in backward
    self.scaler.scale(loss).backward(**kwargs)
  File "/kaggle/temp/venv/lib/python3.10/site-packages/torch/_tensor.py", line 492, in backward
    torch.autograd.backward(
  File "/kaggle/temp/venv/lib/python3.10/site-packages/torch/autograd/__init__.py", line 251, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
SystemError: <built-in method run_backward of torch._C._EngineBase object at 0x7964d78ff850> returned NULL without setting an exception
steps:   0%|                                            | 0/462 [00:17<?, ?it/s]
[2024-01-14 14:14:03,240] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 1126) of binary: /kaggle/temp/venv/bin/python
Traceback (most recent call last):
  File "/kaggle/temp/venv/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/kaggle/temp/venv/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
    args.func(args)
  File "/kaggle/temp/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1008, in launch_command
    multi_gpu_launcher(args)
  File "/kaggle/temp/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 666, in multi_gpu_launcher
    distrib_run.run(args)
  File "/kaggle/temp/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/kaggle/temp/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/kaggle/temp/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
sdxl_train_textual_inversion.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-01-14_14:14:03
  host      : e9d4d6462f9d
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 1127)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-01-14_14:14:03
  host      : e9d4d6462f9d
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 1126)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
ERROR: ld.so: object '/kaggle/temp/libtcmalloc_minimal.so.4' from LD_PRELOAD cannot be preloaded (file too short): ignored.
ERROR: ld.so: object '/kaggle/temp/libtcmalloc_minimal.so.4' from LD_PRELOAD cannot be preloaded (file too short): ignored.
ERROR: ld.so: object '/kaggle/temp/libtcmalloc_minimal.so.4' from LD_PRELOAD cannot be preloaded (file too short): ignored.
Name: torch
Version: 2.1.2
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: packages@pytorch.org
License: BSD-3
Location: /kaggle/temp/venv/lib/python3.10/site-packages
Requires: filelock, fsspec, jinja2, networkx, nvidia-cublas-cu12, nvidia-cuda-cupti-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-runtime-cu12, nvidia-cudnn-cu12, nvidia-cufft-cu12, nvidia-curand-cu12, nvidia-cusolver-cu12, nvidia-cusparse-cu12, nvidia-nccl-cu12, nvidia-nvtx-cu12, sympy, triton, typing-extensions
Required-by: accelerate, open-clip-torch, pytorch-lightning, timm, torchmetrics, torchvision

With --ddp_gradient_as_bucket_view only

steps:   0%|                     | 1/462 [00:18<2:19:27, 18.15s/it, loss=0.0952]Traceback (most recent call last):
  File "/kaggle/temp/sd-scripts/sdxl_train_textual_inversion.py", line 141, in <module>
    trainer.train(args)
  File "/kaggle/temp/sd-scripts/train_textual_inversion.py", line 607, in train
    accelerator.backward(loss)
  File "/kaggle/temp/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1903, in backward
    self.scaler.scale(loss).backward(**kwargs)
  File "/kaggle/temp/venv/lib/python3.10/site-packages/torch/_tensor.py", line 492, in backward
    torch.autograd.backward(
  File "/kaggle/temp/venv/lib/python3.10/site-packages/torch/autograd/__init__.py", line 251, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/kaggle/temp/venv/lib/python3.10/site-packages/torch/autograd/function.py", line 288, in apply
    return user_fn(self, *args)
  File "/kaggle/temp/venv/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 288, in backward
    torch.autograd.backward(outputs_with_grad, args_with_grad)
  File "/kaggle/temp/venv/lib/python3.10/site-packages/torch/autograd/__init__.py", line 251, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 300.00 MiB. GPU 1 has a total capacty of 14.75 GiB of which 173.06 MiB is free. Process 15213 has 14.58 GiB memory in use. Of the allocated memory 13.18 GiB is allocated by PyTorch, and 1.15 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
[2024-01-14 14:07:50,667] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1048 closing signal SIGTERM
[2024-01-14 14:07:50,831] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 1 (pid: 1049) of binary: /kaggle/temp/venv/bin/python
Traceback (most recent call last):
  File "/kaggle/temp/venv/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/kaggle/temp/venv/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
    args.func(args)
  File "/kaggle/temp/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1008, in launch_command
    multi_gpu_launcher(args)
  File "/kaggle/temp/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 666, in multi_gpu_launcher
    distrib_run.run(args)
  File "/kaggle/temp/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/kaggle/temp/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/kaggle/temp/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
sdxl_train_textual_inversion.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-01-14_14:07:50
  host      : e9d4d6462f9d
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 1049)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
ERROR: ld.so: object '/kaggle/temp/libtcmalloc_minimal.so.4' from LD_PRELOAD cannot be preloaded (file too short): ignored.
ERROR: ld.so: object '/kaggle/temp/libtcmalloc_minimal.so.4' from LD_PRELOAD cannot be preloaded (file too short): ignored.
ERROR: ld.so: object '/kaggle/temp/libtcmalloc_minimal.so.4' from LD_PRELOAD cannot be preloaded (file too short): ignored.
Name: torch
Version: 2.1.2
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: packages@pytorch.org
License: BSD-3
Location: /kaggle/temp/venv/lib/python3.10/site-packages
Requires: filelock, fsspec, jinja2, networkx, nvidia-cublas-cu12, nvidia-cuda-cupti-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-runtime-cu12, nvidia-cudnn-cu12, nvidia-cufft-cu12, nvidia-curand-cu12, nvidia-cusolver-cu12, nvidia-cusparse-cu12, nvidia-nccl-cu12, nvidia-nvtx-cu12, sympy, triton, typing-extensions
Required-by: accelerate, open-clip-torch, pytorch-lightning, timm, torchmetrics, torchvision

kohya-ss commented 11 months ago

With --ddp_gradient_as_bucket_view only, the error seems to be caused by OOM.

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 300.00 MiB. GPU 1 has a total capacty of 14.75 GiB of which 173.06 MiB is free. Process 15213 has 14.58 GiB memory in use. Of the allocated memory 13.18 GiB is allocated by PyTorch, and 1.15 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Could you please decrease the batch size?

DKnight54 commented 10 months ago

With --ddp_gradient_as_bucket_view only, the error seems to be caused by OOM.

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 300.00 MiB. GPU 1 has a total capacty of 14.75 GiB of which 173.06 MiB is free. Process 15213 has 14.58 GiB memory in use. Of the allocated memory 13.18 GiB is allocated by PyTorch, and 1.15 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Could you please decrease the batch size?

When trying this, training seems to mostly be successful, however, at the end, when saving the final safetensor file, it seems to throw another error, and the last safetensor file is not saved.

epoch 3/3
steps: 100%|██████████████████████| 165/165 [47:47<00:00, 17.38s/it, loss=0.126]
saving last state.
Traceback (most recent call last):
  File "/kaggle/temp/sd-scripts/sdxl_train_textual_inversion.py", line 141, in <module>
    trainer.train(args)
  File "/kaggle/temp/sd-scripts/train_textual_inversion.py", line 739, in train
    updated_embs = text_encoder.get_input_embeddings().weight[token_ids].data.detach().clone()
  File "/kaggle/temp/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1695, in __getattr__
    raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
AttributeError: 'DistributedDataParallel' object has no attribute 'get_input_embeddings'

saving checkpoint: /kaggle/temp/output/TCLohwx_TI/TCLohwx_TI.safetensors
model saved.
steps: 100%|██████████████████████| 165/165 [48:35<00:00, 17.67s/it, loss=0.126]
[2024-01-20 13:26:15,105] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 1 (pid: 806) of binary: /kaggle/temp/venv/bin/python
Traceback (most recent call last):
  File "/kaggle/temp/venv/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/kaggle/temp/venv/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
    args.func(args)
  File "/kaggle/temp/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1008, in launch_command
    multi_gpu_launcher(args)
  File "/kaggle/temp/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 666, in multi_gpu_launcher
    distrib_run.run(args)
  File "/kaggle/temp/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/kaggle/temp/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/kaggle/temp/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
sdxl_train_textual_inversion.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-01-20_13:26:15
  host      : 73fc472f4265
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 806)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Name: torch
Version: 2.1.2
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: packages@pytorch.org
License: BSD-3
Location: /kaggle/temp/venv/lib/python3.10/site-packages
Requires: filelock, fsspec, jinja2, networkx, nvidia-cublas-cu12, nvidia-cuda-cupti-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-runtime-cu12, nvidia-cudnn-cu12, nvidia-cufft-cu12, nvidia-curand-cu12, nvidia-cusolver-cu12, nvidia-cusparse-cu12, nvidia-nccl-cu12, nvidia-nvtx-cu12, sympy, triton, typing-extensions
Required-by: accelerate, open-clip-torch, pytorch-lightning, timm, torchmetrics, torchvision

kohya-ss commented 10 months ago

I've updated. I hope it fixes the issue😀

DKnight54 commented 10 months ago

The saving issue has been fixed and able to save now, although I... am puzzled with the training results I'm getting. Doing more experiments to test with and without --ddp_gradient_as_bucket_view to see how it's affecting the training results.

Also, while I do not understand the behaivour, the issue I've had with random VRAM OOMs I've been experiencing during image sampling on multiple GPU enviroments is because there's spike in VRAM usage when the latents are being converted into images in the sample image generation using the VAE. I'm not quite sure as what the root cause is as it seems to be somewhere in the Torch libraries, but basically, it seems that the VRAM is reserved, although in theory is available, causing it to OOM.

I've been able to workaround this issue by inserting a call to torch.cuda.empty_cache() after the latents have been generated, and before the latents are converted into images.

kohya-ss / sd-scripts

Issue with training SDXL Textual Inversion in multi GPU environment #1019