Closed DKnight54 closed 10 months ago
I updated dev branch. I didn't test it with DDP, but I hope it fixes this issue.
Hey @kohya-ss,
Tried out the dev branch and it's able to start training, but I ran into a problem while generating sample images. Stragely, it looks like it might be related to this issue with generating samples images while training LoRAs as well.
Traceback (most recent call last):
File "/kaggle/temp/sd-scripts/sdxl_train_textual_inversion.py", line 141, in <module>
trainer.train(args)
File "/kaggle/temp/sd-scripts/train_textual_inversion.py", line 533, in train
self.sample_images(
File "/kaggle/temp/sd-scripts/sdxl_train_textual_inversion.py", line 89, in sample_images
sdxl_train_util.sample_images(
File "/kaggle/temp/sd-scripts/library/sdxl_train_util.py", line 367, in sample_images
return train_util.sample_images_common(SdxlStableDiffusionLongPromptWeightingPipeline, *args, **kwargs)
File "/kaggle/temp/sd-scripts/library/train_util.py", line 4758, in sample_images_common
image = pipeline.latents_to_image(latents)[0]
File "/kaggle/temp/sd-scripts/library/sdxl_lpw_stable_diffusion.py", line 1035, in latents_to_image
image = self.decode_latents(latents.to(self.vae.dtype))
File "/kaggle/temp/sd-scripts/library/sdxl_lpw_stable_diffusion.py", line 714, in decode_latents
image = self.vae.decode(latents.to(self.vae.dtype)).sample
File "/kaggle/temp/venv/lib/python3.10/site-packages/diffusers/utils/accelerate_utils.py", line 46, in wrapper
return method(self, *args, **kwargs)
File "/kaggle/temp/venv/lib/python3.10/site-packages/diffusers/models/autoencoder_kl.py", line 286, in decode
decoded = self._decode(z).sample
File "/kaggle/temp/venv/lib/python3.10/site-packages/diffusers/models/autoencoder_kl.py", line 273, in _decode
dec = self.decoder(z)
File "/kaggle/temp/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/kaggle/temp/venv/lib/python3.10/site-packages/diffusers/models/vae.py", line 272, in forward
sample = up_block(sample, latent_embeds)
File "/kaggle/temp/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/kaggle/temp/venv/lib/python3.10/site-packages/diffusers/models/unet_2d_blocks.py", line 2375, in forward
hidden_states = upsampler(hidden_states)
File "/kaggle/temp/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/kaggle/temp/venv/lib/python3.10/site-packages/diffusers/models/resnet.py", line 170, in forward
hidden_states = self.conv(hidden_states, scale)
File "/kaggle/temp/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/kaggle/temp/venv/lib/python3.10/site-packages/diffusers/models/lora.py", line 163, in forward
return F.conv2d(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1024.00 MiB (GPU 0; 14.76 GiB total capacity; 12.29 GiB already allocated; 905.75 MiB free; 12.82 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
steps: 0%| | 0/120 [00:35<?, ?it/s]
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 748 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 747) of binary: /kaggle/temp/venv/bin/python
Traceback (most recent call last):
File "/kaggle/temp/venv/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/kaggle/temp/venv/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
args.func(args)
File "/kaggle/temp/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 977, in launch_command
multi_gpu_launcher(args)
File "/kaggle/temp/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 646, in multi_gpu_launcher
distrib_run.run(args)
File "/kaggle/temp/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/kaggle/temp/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/kaggle/temp/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
sdxl_train_textual_inversion.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-12-24_15:09:43
host : ba06d8ffffab
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 747)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Hrm.. Decided to give up on sample image generation during training for now, but have been running into this issue instead at the start of training
steps: 0%| | 0/6840 [00:00<?, ?it/s]
epoch 1/6
Traceback (most recent call last):
File "/kaggle/temp/sd-scripts/sdxl_train_textual_inversion.py", line 141, in <module>
trainer.train(args)
File "/kaggle/temp/sd-scripts/train_textual_inversion.py", line 619, in train
accelerator.unwrap_model(text_encoder).get_input_embeddings().weight[
RuntimeError: Index put requires the source and destination dtypes match, got Half for the destination and Float for the source.
Traceback (most recent call last):
File "/kaggle/temp/sd-scripts/sdxl_train_textual_inversion.py", line 141, in <module>
trainer.train(args)
File "/kaggle/temp/sd-scripts/train_textual_inversion.py", line 619, in train
accelerator.unwrap_model(text_encoder).get_input_embeddings().weight[
RuntimeError: Index put requires the source and destination dtypes match, got Half for the destination and Float for the source.
steps: 0%| | 0/6840 [00:13<?, ?it/s]
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1544 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 1545) of binary: /kaggle/temp/venv/bin/python
Traceback (most recent call last):
File "/kaggle/temp/venv/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/kaggle/temp/venv/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
args.func(args)
File "/kaggle/temp/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 977, in launch_command
multi_gpu_launcher(args)
File "/kaggle/temp/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 646, in multi_gpu_launcher
distrib_run.run(args)
File "/kaggle/temp/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/kaggle/temp/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/kaggle/temp/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
sdxl_train_textual_inversion.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-12-30_16:43:49
host : 56716c7ca920
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 1545)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Sorry for the delay. full_fp16
or full_bf16
seems to cause this error. I've updated dev branch again. It will fix the error.
No worries, hope it was because you were enjoying the holidays with friends and families. I'll test it as soon as I can.
Got a new error this time
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 735) of binary: /kaggle/temp/venv/bin/python
Traceback (most recent call last):
File "/kaggle/temp/venv/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/kaggle/temp/venv/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
args.func(args)
File "/kaggle/temp/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1008, in launch_command
multi_gpu_launcher(args)
File "/kaggle/temp/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 666, in multi_gpu_launcher
distrib_run.run(args)
File "/kaggle/temp/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/kaggle/temp/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/kaggle/temp/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
sdxl_train_textual_inversion.py FAILED
------------------------------------------------------------
Failures:
[1]:
time : 2024-01-06_03:29:24
host : 3f8eb37d2946
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 736)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-01-06_03:29:24
host : 3f8eb37d2946
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 735)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Can you please share a few more logs above as the log message does not include the cause.
Oops, sorry, Thought I had already included the full traceback
The params:
accelerate launch --config_file="/kaggle/temp/sd-scripts/accelerate_config/config.yaml" --num_cpu_threads_per_process=1 sdxl_train_textual_inversion.py --sample_prompts="/kaggle/temp/LoRA/train_data/json/sample_prompt_ti.toml" --no_half_vae --shuffle_caption --ddp_gradient_as_bucket_view --ddp_static_graph --pretrained_model_name_or_path="/kaggle/input/stable-diffusion-xl/pytorch/base-1-0/1" --vae="/kaggle/temp/vae/sdxl_vae.safetensors" --output_dir="/kaggle/temp/output/TCL_Waifu_TI" --output_name="TCL_Waifu_TI" --token_string="TCL_Waifu" --init_word="woman" --num_vectors_per_token=10 --save_precision="fp16" --save_every_n_epochs=1 --train_batch_size=4 --max_token_length=225 --mem_eff_attn --sdpa --max_train_epochs=6 --gradient_checkpointing --gradient_accumulation_steps=1 --mixed_precision="fp16" --cache_latents_to_disk --prior_loss_weight=1.0 --full_fp16 --sample_every_n_epochs=1 --sample_sampler="k_euler_a" --save_model_as="safetensors" --huggingface_token="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" --optimizer_type="AdaFactor" --learning_rate=1e-06 --max_grad_norm=0 --lr_scheduler="constant" --lr_warmup_steps=0 --dataset_config="/kaggle/temp/LoRA/train_data/json/dataset_config_TI.toml"
The traceback
Traceback (most recent call last):
File "/kaggle/temp/sd-scripts/sdxl_train_textual_inversion.py", line 141, in <module>
trainer.train(args)
File "/kaggle/temp/sd-scripts/train_textual_inversion.py", line 605, in train
accelerator.backward(loss)
File "/kaggle/temp/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1903, in backward
self.scaler.scale(loss).backward(**kwargs)
File "/kaggle/temp/venv/lib/python3.10/site-packages/torch/_tensor.py", line 487, in backward
torch.autograd.backward(
File "/kaggle/temp/venv/lib/python3.10/site-packages/torch/autograd/__init__.py", line 200, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
SystemError: <built-in method run_backward of torch._C._EngineBase object at 0x7a102a23f410> returned NULL without setting an exception
Traceback (most recent call last):
File "/kaggle/temp/sd-scripts/sdxl_train_textual_inversion.py", line 141, in <module>
trainer.train(args)
File "/kaggle/temp/sd-scripts/train_textual_inversion.py", line 605, in train
accelerator.backward(loss)
File "/kaggle/temp/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1903, in backward
self.scaler.scale(loss).backward(**kwargs)
File "/kaggle/temp/venv/lib/python3.10/site-packages/torch/_tensor.py", line 487, in backward
torch.autograd.backward(
File "/kaggle/temp/venv/lib/python3.10/site-packages/torch/autograd/__init__.py", line 200, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
SystemError: <built-in method run_backward of torch._C._EngineBase object at 0x79c4683c3410> returned NULL without setting an exception
steps: 0%| | 0/1728 [00:16<?, ?it/s]
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 735) of binary: /kaggle/temp/venv/bin/python
Traceback (most recent call last):
File "/kaggle/temp/venv/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/kaggle/temp/venv/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
args.func(args)
File "/kaggle/temp/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1008, in launch_command
multi_gpu_launcher(args)
File "/kaggle/temp/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 666, in multi_gpu_launcher
distrib_run.run(args)
File "/kaggle/temp/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/kaggle/temp/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/kaggle/temp/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
sdxl_train_textual_inversion.py FAILED
------------------------------------------------------------
Failures:
[1]:
time : 2024-01-06_03:29:24
host : 3f8eb37d2946
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 736)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-01-06_03:29:24
host : 3f8eb37d2946
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 735)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Thank you! This error seems to be related to https://github.com/pytorch/pytorch/issues/75750. So it seems that the correct error message is not being displayed. Since this PyTorch issue seems to have already been resolved, could you please update PyTorch to the latest version and try again to see what error message you get.
Interesting. Even running the current Torch 2.1.2 gives the identical error.
One... interesting clue is running with only the --ddp_gradient_as_bucket_view
flag seems to allow it to complete one step before failing. Running without both currently works in the dev version.
With --ddp_gradient_as_bucket_view
and --ddp_static_graph
Traceback (most recent call last):
File "/kaggle/temp/sd-scripts/sdxl_train_textual_inversion.py", line 141, in <module>
trainer.train(args)
File "/kaggle/temp/sd-scripts/train_textual_inversion.py", line 607, in train
accelerator.backward(loss)
File "/kaggle/temp/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1903, in backward
self.scaler.scale(loss).backward(**kwargs)
File "/kaggle/temp/venv/lib/python3.10/site-packages/torch/_tensor.py", line 492, in backward
torch.autograd.backward(
File "/kaggle/temp/venv/lib/python3.10/site-packages/torch/autograd/__init__.py", line 251, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
SystemError: <built-in method run_backward of torch._C._EngineBase object at 0x7cbc30f73850> returned NULL without setting an exception
steps: 0%| | 0/462 [00:16<?, ?it/s]
Traceback (most recent call last):
File "/kaggle/temp/sd-scripts/sdxl_train_textual_inversion.py", line 141, in <module>
trainer.train(args)
File "/kaggle/temp/sd-scripts/train_textual_inversion.py", line 607, in train
accelerator.backward(loss)
File "/kaggle/temp/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1903, in backward
self.scaler.scale(loss).backward(**kwargs)
File "/kaggle/temp/venv/lib/python3.10/site-packages/torch/_tensor.py", line 492, in backward
torch.autograd.backward(
File "/kaggle/temp/venv/lib/python3.10/site-packages/torch/autograd/__init__.py", line 251, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
SystemError: <built-in method run_backward of torch._C._EngineBase object at 0x7bf81259b850> returned NULL without setting an exception
[2024-01-14 13:56:22,964] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 839) of binary: /kaggle/temp/venv/bin/python
Traceback (most recent call last):
File "/kaggle/temp/venv/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/kaggle/temp/venv/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
args.func(args)
File "/kaggle/temp/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1008, in launch_command
multi_gpu_launcher(args)
File "/kaggle/temp/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 666, in multi_gpu_launcher
distrib_run.run(args)
File "/kaggle/temp/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/kaggle/temp/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/kaggle/temp/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
sdxl_train_textual_inversion.py FAILED
------------------------------------------------------------
Failures:
[1]:
time : 2024-01-14_13:56:22
host : e9d4d6462f9d
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 840)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-01-14_13:56:22
host : e9d4d6462f9d
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 839)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
ERROR: ld.so: object '/kaggle/temp/libtcmalloc_minimal.so.4' from LD_PRELOAD cannot be preloaded (file too short): ignored.
ERROR: ld.so: object '/kaggle/temp/libtcmalloc_minimal.so.4' from LD_PRELOAD cannot be preloaded (file too short): ignored.
ERROR: ld.so: object '/kaggle/temp/libtcmalloc_minimal.so.4' from LD_PRELOAD cannot be preloaded (file too short): ignored.
Name: torch
Version: 2.1.2
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: packages@pytorch.org
License: BSD-3
Location: /kaggle/temp/venv/lib/python3.10/site-packages
Requires: filelock, fsspec, jinja2, networkx, nvidia-cublas-cu12, nvidia-cuda-cupti-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-runtime-cu12, nvidia-cudnn-cu12, nvidia-cufft-cu12, nvidia-curand-cu12, nvidia-cusolver-cu12, nvidia-cusparse-cu12, nvidia-nccl-cu12, nvidia-nvtx-cu12, sympy, triton, typing-extensions
Required-by: accelerate, open-clip-torch, pytorch-lightning, timm, torchmetrics, torchvision
With --ddp_static_graph
only
Traceback (most recent call last):
File "/kaggle/temp/sd-scripts/sdxl_train_textual_inversion.py", line 141, in <module>
trainer.train(args)
File "/kaggle/temp/sd-scripts/train_textual_inversion.py", line 607, in train
accelerator.backward(loss)
File "/kaggle/temp/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1903, in backward
self.scaler.scale(loss).backward(**kwargs)
File "/kaggle/temp/venv/lib/python3.10/site-packages/torch/_tensor.py", line 492, in backward
torch.autograd.backward(
File "/kaggle/temp/venv/lib/python3.10/site-packages/torch/autograd/__init__.py", line 251, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
SystemError: <built-in method run_backward of torch._C._EngineBase object at 0x7aff5729f850> returned NULL without setting an exception
Traceback (most recent call last):
File "/kaggle/temp/sd-scripts/sdxl_train_textual_inversion.py", line 141, in <module>
trainer.train(args)
File "/kaggle/temp/sd-scripts/train_textual_inversion.py", line 607, in train
accelerator.backward(loss)
File "/kaggle/temp/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1903, in backward
self.scaler.scale(loss).backward(**kwargs)
File "/kaggle/temp/venv/lib/python3.10/site-packages/torch/_tensor.py", line 492, in backward
torch.autograd.backward(
File "/kaggle/temp/venv/lib/python3.10/site-packages/torch/autograd/__init__.py", line 251, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
SystemError: <built-in method run_backward of torch._C._EngineBase object at 0x7964d78ff850> returned NULL without setting an exception
steps: 0%| | 0/462 [00:17<?, ?it/s]
[2024-01-14 14:14:03,240] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 1126) of binary: /kaggle/temp/venv/bin/python
Traceback (most recent call last):
File "/kaggle/temp/venv/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/kaggle/temp/venv/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
args.func(args)
File "/kaggle/temp/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1008, in launch_command
multi_gpu_launcher(args)
File "/kaggle/temp/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 666, in multi_gpu_launcher
distrib_run.run(args)
File "/kaggle/temp/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/kaggle/temp/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/kaggle/temp/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
sdxl_train_textual_inversion.py FAILED
------------------------------------------------------------
Failures:
[1]:
time : 2024-01-14_14:14:03
host : e9d4d6462f9d
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 1127)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-01-14_14:14:03
host : e9d4d6462f9d
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 1126)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
ERROR: ld.so: object '/kaggle/temp/libtcmalloc_minimal.so.4' from LD_PRELOAD cannot be preloaded (file too short): ignored.
ERROR: ld.so: object '/kaggle/temp/libtcmalloc_minimal.so.4' from LD_PRELOAD cannot be preloaded (file too short): ignored.
ERROR: ld.so: object '/kaggle/temp/libtcmalloc_minimal.so.4' from LD_PRELOAD cannot be preloaded (file too short): ignored.
Name: torch
Version: 2.1.2
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: packages@pytorch.org
License: BSD-3
Location: /kaggle/temp/venv/lib/python3.10/site-packages
Requires: filelock, fsspec, jinja2, networkx, nvidia-cublas-cu12, nvidia-cuda-cupti-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-runtime-cu12, nvidia-cudnn-cu12, nvidia-cufft-cu12, nvidia-curand-cu12, nvidia-cusolver-cu12, nvidia-cusparse-cu12, nvidia-nccl-cu12, nvidia-nvtx-cu12, sympy, triton, typing-extensions
Required-by: accelerate, open-clip-torch, pytorch-lightning, timm, torchmetrics, torchvision
With --ddp_gradient_as_bucket_view
only
steps: 0%| | 1/462 [00:18<2:19:27, 18.15s/it, loss=0.0952]Traceback (most recent call last):
File "/kaggle/temp/sd-scripts/sdxl_train_textual_inversion.py", line 141, in <module>
trainer.train(args)
File "/kaggle/temp/sd-scripts/train_textual_inversion.py", line 607, in train
accelerator.backward(loss)
File "/kaggle/temp/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1903, in backward
self.scaler.scale(loss).backward(**kwargs)
File "/kaggle/temp/venv/lib/python3.10/site-packages/torch/_tensor.py", line 492, in backward
torch.autograd.backward(
File "/kaggle/temp/venv/lib/python3.10/site-packages/torch/autograd/__init__.py", line 251, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/kaggle/temp/venv/lib/python3.10/site-packages/torch/autograd/function.py", line 288, in apply
return user_fn(self, *args)
File "/kaggle/temp/venv/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 288, in backward
torch.autograd.backward(outputs_with_grad, args_with_grad)
File "/kaggle/temp/venv/lib/python3.10/site-packages/torch/autograd/__init__.py", line 251, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 300.00 MiB. GPU 1 has a total capacty of 14.75 GiB of which 173.06 MiB is free. Process 15213 has 14.58 GiB memory in use. Of the allocated memory 13.18 GiB is allocated by PyTorch, and 1.15 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
[2024-01-14 14:07:50,667] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1048 closing signal SIGTERM
[2024-01-14 14:07:50,831] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 1 (pid: 1049) of binary: /kaggle/temp/venv/bin/python
Traceback (most recent call last):
File "/kaggle/temp/venv/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/kaggle/temp/venv/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
args.func(args)
File "/kaggle/temp/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1008, in launch_command
multi_gpu_launcher(args)
File "/kaggle/temp/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 666, in multi_gpu_launcher
distrib_run.run(args)
File "/kaggle/temp/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/kaggle/temp/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/kaggle/temp/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
sdxl_train_textual_inversion.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-01-14_14:07:50
host : e9d4d6462f9d
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 1049)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
ERROR: ld.so: object '/kaggle/temp/libtcmalloc_minimal.so.4' from LD_PRELOAD cannot be preloaded (file too short): ignored.
ERROR: ld.so: object '/kaggle/temp/libtcmalloc_minimal.so.4' from LD_PRELOAD cannot be preloaded (file too short): ignored.
ERROR: ld.so: object '/kaggle/temp/libtcmalloc_minimal.so.4' from LD_PRELOAD cannot be preloaded (file too short): ignored.
Name: torch
Version: 2.1.2
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: packages@pytorch.org
License: BSD-3
Location: /kaggle/temp/venv/lib/python3.10/site-packages
Requires: filelock, fsspec, jinja2, networkx, nvidia-cublas-cu12, nvidia-cuda-cupti-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-runtime-cu12, nvidia-cudnn-cu12, nvidia-cufft-cu12, nvidia-curand-cu12, nvidia-cusolver-cu12, nvidia-cusparse-cu12, nvidia-nccl-cu12, nvidia-nvtx-cu12, sympy, triton, typing-extensions
Required-by: accelerate, open-clip-torch, pytorch-lightning, timm, torchmetrics, torchvision
With --ddp_gradient_as_bucket_view
only, the error seems to be caused by OOM.
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 300.00 MiB. GPU 1 has a total capacty of 14.75 GiB of which 173.06 MiB is free. Process 15213 has 14.58 GiB memory in use. Of the allocated memory 13.18 GiB is allocated by PyTorch, and 1.15 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Could you please decrease the batch size?
With
--ddp_gradient_as_bucket_view
only, the error seems to be caused by OOM.torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 300.00 MiB. GPU 1 has a total capacty of 14.75 GiB of which 173.06 MiB is free. Process 15213 has 14.58 GiB memory in use. Of the allocated memory 13.18 GiB is allocated by PyTorch, and 1.15 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Could you please decrease the batch size?
When trying this, training seems to mostly be successful, however, at the end, when saving the final safetensor file, it seems to throw another error, and the last safetensor file is not saved.
epoch 3/3
steps: 100%|██████████████████████| 165/165 [47:47<00:00, 17.38s/it, loss=0.126]
saving last state.
Traceback (most recent call last):
File "/kaggle/temp/sd-scripts/sdxl_train_textual_inversion.py", line 141, in <module>
trainer.train(args)
File "/kaggle/temp/sd-scripts/train_textual_inversion.py", line 739, in train
updated_embs = text_encoder.get_input_embeddings().weight[token_ids].data.detach().clone()
File "/kaggle/temp/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1695, in __getattr__
raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
AttributeError: 'DistributedDataParallel' object has no attribute 'get_input_embeddings'
saving checkpoint: /kaggle/temp/output/TCLohwx_TI/TCLohwx_TI.safetensors
model saved.
steps: 100%|██████████████████████| 165/165 [48:35<00:00, 17.67s/it, loss=0.126]
[2024-01-20 13:26:15,105] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 1 (pid: 806) of binary: /kaggle/temp/venv/bin/python
Traceback (most recent call last):
File "/kaggle/temp/venv/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/kaggle/temp/venv/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
args.func(args)
File "/kaggle/temp/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1008, in launch_command
multi_gpu_launcher(args)
File "/kaggle/temp/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 666, in multi_gpu_launcher
distrib_run.run(args)
File "/kaggle/temp/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/kaggle/temp/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/kaggle/temp/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
sdxl_train_textual_inversion.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-01-20_13:26:15
host : 73fc472f4265
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 806)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Name: torch
Version: 2.1.2
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: packages@pytorch.org
License: BSD-3
Location: /kaggle/temp/venv/lib/python3.10/site-packages
Requires: filelock, fsspec, jinja2, networkx, nvidia-cublas-cu12, nvidia-cuda-cupti-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-runtime-cu12, nvidia-cudnn-cu12, nvidia-cufft-cu12, nvidia-curand-cu12, nvidia-cusolver-cu12, nvidia-cusparse-cu12, nvidia-nccl-cu12, nvidia-nvtx-cu12, sympy, triton, typing-extensions
Required-by: accelerate, open-clip-torch, pytorch-lightning, timm, torchmetrics, torchvision
I've updated. I hope it fixes the issue😀
The saving issue has been fixed and able to save now, although I... am puzzled with the training results I'm getting. Doing more experiments to test with and without --ddp_gradient_as_bucket_view to see how it's affecting the training results.
Also, while I do not understand the behaivour, the issue I've had with random VRAM OOMs I've been experiencing during image sampling on multiple GPU enviroments is because there's spike in VRAM usage when the latents are being converted into images in the sample image generation using the VAE. I'm not quite sure as what the root cause is as it seems to be somewhere in the Torch libraries, but basically, it seems that the VRAM is reserved, although in theory is available, causing it to OOM.
I've been able to workaround this issue by inserting a call to torch.cuda.empty_cache() after the latents have been generated, and before the latents are converted into images.
I've been trying out textual inversion for SDXL and while it seems to run fine in single GPU enviroment, trying to train in dual GPU enviroment throws a
AttributeErrorAttributeError: : 'DistributedDataParallel' object has no attribute 'text_model''DistributedDataParallel' object has no attribute 'text_model'
error, even with the--ddp_gradient_as_bucket_view
and--ddp_bucket_view
Full stack trace as below: