[CUDA out of memory] training in 1024 × 576 resolution in the A100 80G

@pixeli99 1GPU utilizing all 40GB VRAM, eval-off, 512x256 small-res unet training test was successful however same error here, High resolution appears to be an issue. For

Default frame resolution for SVD-xt-1.1 - 1280x720, 6*H100/480GB VRAM


accelerate launch train_svd.py --pretrained_model_name_or_path=stabilityai/stable-video-diffusion-img2vid-xt --per_gpu_batch_size=1 --gradient_accumulation_steps=1 --max_train_steps=50000  --width=1280 --height=720 --checkpointing_steps=1000 --checkpoints_total_limit=1 --learning_rate=1e-5 --lr_warmup_steps=0  --seed=123 --mixed_precision="fp16" --validation_steps=200

self.base_folder = 'base_folder'

SVD_Xtend/base_folder/video_folder7
                                 | - frame0.png
                                 ...
                                 | - frame31775.png


+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14              Driver Version: 550.54.14      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H100 80GB HBM3          On  |   00000000:19:00.0 Off |                    0 |
| N/A   41C    P0            101W /  700W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA H100 80GB HBM3          On  |   00000000:3B:00.0 Off |                    0 |
| N/A   37C    P0            102W /  700W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA H100 80GB HBM3          On  |   00000000:4C:00.0 Off |                    0 |
| N/A   36C    P0            102W /  700W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA H100 80GB HBM3          On  |   00000000:5D:00.0 Off |                    0 |
| N/A   41C    P0            104W /  700W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA H100 80GB HBM3          On  |   00000000:9B:00.0 Off |                    0 |
| N/A   41C    P0            101W /  700W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA H100 80GB HBM3          On  |   00000000:BB:00.0 Off |                    0 |
| N/A   36C    P0             99W /  700W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
root@6d107f50ed63:/SVD_XtTrain# nvcc -Vnvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0
####
pip install -q -U diffusers transformers accelerate pip install opencv-python einops
pip install bitsandbytes
pip3 install -U xformers --index-url https://download.pytorch.org/whl/cu121
####

root@6d107f50ed63:/SVD_XtTrain# accelerate config------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------In which compute environment are you running?
This machine
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Which type of machine are you using?
multi-GPU
How many different machines will you use (use more than 1 for multi-node training)? [1]: 1
Should distributed operations be checked while running for errors? This can avoid timeout issues but will be slower. [yes/NO]: NO
Do you wish to optimize your script with torch dynamo?[yes/NO]:NO
Do you want to use DeepSpeed? [yes/NO]: NO
Do you want to use FullyShardedDataParallel? [yes/NO]: NO
Do you want to use Megatron-LM ? [yes/NO]: NO
How many GPU(s) should be used for distributed training? [1]:6
What GPU(s) (by id) should be used for training on this machine as a comma-seperated list? [all]:all
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------Do you wish to use FP16 or BF16 (mixed precision)?
fp16
accelerate configuration saved at /root/.cache/huggingface/accelerate/default_config.yaml

#### xformers enable
root@6d107f50ed63:/SVD_XtTrain# accelerate launch train_svd.py --pretrained_model_name_or_path=stable-video-diffusion-img2vid-xt-1-1 --per_gpu_batch_size=1 --gradient_accumulation_steps=1 --max_train_steps=50000  --width=1280  --height=720 --checkpointing_steps=1000 --checkpoints_total_limit=1 --learning_rate=1e-5 --lr_warmup_steps=0  --seed=123 --mixed_precision="fp16" --validation_steps=200 --use_8bit_adam --use_ema --enable_xformers_memory_efficient_attention
/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py:391: UserWarning: `log_with=tensorboard` was passed but no supported trackers are currently installed.
  warnings.warn(f"`log_with={log_with}` was passed but no supported trackers are currently installed.")
04/27/2024 17:50:47 - INFO - __main__ - Distributed environment: MULTI_GPU  Backend: nccl
Num processes: 6
Process index: 5
Local process index: 5
Device: cuda:5

Mixed precision type: fp16

/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py:391: UserWarning: `log_with=tensorboard` was passed but no supported trackers are currently installed.
  warnings.warn(f"`log_with={log_with}` was passed but no supported trackers are currently installed.")
04/27/2024 17:50:48 - INFO - __main__ - Distributed environment: MULTI_GPU  Backend: nccl
Num processes: 6
Process index: 0
Local process index: 0
Device: cuda:0

Mixed precision type: fp16

{'rescale_betas_zero_snr'} was not found in config. Values will be initialized to default values.
/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py:391: UserWarning: `log_with=tensorboard` was passed but no supported trackers are currently installed.
  warnings.warn(f"`log_with={log_with}` was passed but no supported trackers are currently installed.")
04/27/2024 17:50:48 - INFO - __main__ - Distributed environment: MULTI_GPU  Backend: nccl
Num processes: 6
Process index: 3
Local process index: 3
Device: cuda:3

Mixed precision type: fp16

/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py:391: UserWarning: `log_with=tensorboard` was passed but no supported trackers are currently installed.
  warnings.warn(f"`log_with={log_with}` was passed but no supported trackers are currently installed.")
/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py:391: UserWarning: `log_with=tensorboard` was passed but no supported trackers are currently installed.
  warnings.warn(f"`log_with={log_with}` was passed but no supported trackers are currently installed.")
/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py:391: UserWarning: `log_with=tensorboard` was passed but no supported trackers are currently installed.
  warnings.warn(f"`log_with={log_with}` was passed but no supported trackers are currently installed.")
04/27/2024 17:50:48 - INFO - __main__ - Distributed environment: MULTI_GPU  Backend: nccl
Num processes: 6
Process index: 1
Local process index: 1
Device: cuda:1

Mixed precision type: fp16

04/27/2024 17:50:48 - INFO - __main__ - Distributed environment: MULTI_GPU  Backend: nccl
Num processes: 6
Process index: 4
Local process index: 4
Device: cuda:4

Mixed precision type: fp16

04/27/2024 17:50:48 - INFO - __main__ - Distributed environment: MULTI_GPU  Backend: nccl
Num processes: 6
Process index: 2
Local process index: 2
Device: cuda:2

Mixed precision type: fp16

04/27/2024 17:51:01 - INFO - __main__ - ***** Running training *****
04/27/2024 17:51:01 - INFO - __main__ -   Num examples = 100000
04/27/2024 17:51:01 - INFO - __main__ -   Num Epochs = 3
04/27/2024 17:51:01 - INFO - __main__ -   Instantaneous batch size per device = 1
04/27/2024 17:51:01 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 6
04/27/2024 17:51:01 - INFO - __main__ -   Gradient Accumulation steps = 1
04/27/2024 17:51:01 - INFO - __main__ -   Total optimization steps = 50000
Steps:   0%|                                                                                                                                                                                                                                                          | 0/50000 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/SVD_XtTrain/train_svd.py", line 1255, in <module>
    main()
  File "/SVD_XtTrain/train_svd.py", line 1083, in main
    model_pred = unet(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 1523, in forward
    else self._run_ddp_forward(*inputs, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 1359, in _run_ddp_forward
    return self.module(*inputs, **kwargs)  # type: ignore[index]
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 825, in forward
    return model_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 813, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
  File "/usr/local/lib/python3.10/dist-packages/torch/amp/autocast_mode.py", line 16, in decorate_autocast
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/unets/unet_spatio_temporal_condition.py", line 434, in forward
    sample, res_samples = downsample_block(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/unets/unet_3d_blocks.py", line 2183, in forward
    hidden_states = attn(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/transformers/transformer_temporal.py", line 359, in forward
    hidden_states_mix = temporal_block(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/attention.py", line 507, in forward
    attn_output = self.attn1(norm_hidden_states, encoder_hidden_states=None)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/attention_processor.py", line 522, in forward
    return self.processor(
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/attention_processor.py", line 1191, in __call__
    hidden_states = xformers.ops.memory_efficient_attention(
  File "/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/__init__.py", line 247, in memory_efficient_attention
    return _memory_efficient_attention(
  File "/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/__init__.py", line 370, in _memory_efficient_attention
    return _fMHA.apply(
  File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 553, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/__init__.py", line 61, in forward
    out, op_ctx = _memory_efficient_attention_forward_requires_grad(
  File "/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/__init__.py", line 398, in _memory_efficient_attention_forward_requires_grad
    out = op.apply(inp, needs_gradient=True)
  File "/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/flash.py", line 523, in apply
    out, softmax_lse, rng_state = cls.OPERATOR(
  File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 755, in __call__
    return self._op(*args, **(kwargs or {}))
  File "/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/flash.py", line 114, in _flash_fwd
    ) = _C_flashattention.fwd(
RuntimeError: CUDA error: invalid configuration argument
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Steps:   0%|                                                                                                                                                                                                                                                          | 0/50000 [00:03<?, ?it/s]
Traceback (most recent call last):
  File "/SVD_XtTrain/train_svd.py", line 1255, in <module>
    main()
  File "/SVD_XtTrain/train_svd.py", line 1083, in main
    model_pred = unet(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 1523, in forward
    else self._run_ddp_forward(*inputs, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 1359, in _run_ddp_forward
    return self.module(*inputs, **kwargs)  # type: ignore[index]
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 825, in forward
    return model_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 813, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
  File "/usr/local/lib/python3.10/dist-packages/torch/amp/autocast_mode.py", line 16, in decorate_autocast
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/unets/unet_spatio_temporal_condition.py", line 434, in forward
    sample, res_samples = downsample_block(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/unets/unet_3d_blocks.py", line 2183, in forward
    hidden_states = attn(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/transformers/transformer_temporal.py", line 359, in forward
    hidden_states_mix = temporal_block(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/attention.py", line 507, in forward
    attn_output = self.attn1(norm_hidden_states, encoder_hidden_states=None)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/attention_processor.py", line 522, in forward
    return self.processor(
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/attention_processor.py", line 1191, in __call__
    hidden_states = xformers.ops.memory_efficient_attention(
  File "/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/__init__.py", line 247, in memory_efficient_attention
    return _memory_efficient_attention(
  File "/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/__init__.py", line 370, in _memory_efficient_attention
    return _fMHA.apply(
  File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 553, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/__init__.py", line 61, in forward
    out, op_ctx = _memory_efficient_attention_forward_requires_grad(
  File "/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/__init__.py", line 398, in _memory_efficient_attention_forward_requires_grad
    out = op.apply(inp, needs_gradient=True)
  File "/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/flash.py", line 523, in apply
    out, softmax_lse, rng_state = cls.OPERATOR(
  File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 755, in __call__
    return self._op(*args, **(kwargs or {}))
  File "/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/flash.py", line 114, in _flash_fwd
    ) = _C_flashattention.fwd(
RuntimeError: CUDA error: invalid configuration argument
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Traceback (most recent call last):
  File "/SVD_XtTrain/train_svd.py", line 1255, in <module>
    main()
  File "/SVD_XtTrain/train_svd.py", line 1083, in main
    model_pred = unet(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 1523, in forward
    else self._run_ddp_forward(*inputs, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 1359, in _run_ddp_forward
    return self.module(*inputs, **kwargs)  # type: ignore[index]
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 825, in forward
    return model_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 813, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
  File "/usr/local/lib/python3.10/dist-packages/torch/amp/autocast_mode.py", line 16, in decorate_autocast
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/unets/unet_spatio_temporal_condition.py", line 434, in forward
    sample, res_samples = downsample_block(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/unets/unet_3d_blocks.py", line 2183, in forward
    hidden_states = attn(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/transformers/transformer_temporal.py", line 359, in forward
    hidden_states_mix = temporal_block(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/attention.py", line 507, in forward
    attn_output = self.attn1(norm_hidden_states, encoder_hidden_states=None)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/attention_processor.py", line 522, in forward
    return self.processor(
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/attention_processor.py", line 1191, in __call__
    hidden_states = xformers.ops.memory_efficient_attention(
  File "/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/__init__.py", line 247, in memory_efficient_attention
    return _memory_efficient_attention(
  File "/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/__init__.py", line 370, in _memory_efficient_attention
    return _fMHA.apply(
  File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 553, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/__init__.py", line 61, in forward
    out, op_ctx = _memory_efficient_attention_forward_requires_grad(
  File "/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/__init__.py", line 398, in _memory_efficient_attention_forward_requires_grad
    out = op.apply(inp, needs_gradient=True)
  File "/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/flash.py", line 523, in apply
    out, softmax_lse, rng_state = cls.OPERATOR(
  File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 755, in __call__
    return self._op(*args, **(kwargs or {}))
  File "/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/flash.py", line 114, in _flash_fwd
    ) = _C_flashattention.fwd(
RuntimeError: CUDA error: invalid configuration argument
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Traceback (most recent call last):
  File "/SVD_XtTrain/train_svd.py", line 1255, in <module>
    main()
  File "/SVD_XtTrain/train_svd.py", line 1083, in main
    model_pred = unet(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 1523, in forward
    else self._run_ddp_forward(*inputs, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 1359, in _run_ddp_forward
    return self.module(*inputs, **kwargs)  # type: ignore[index]
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 825, in forward
    return model_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 813, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
  File "/usr/local/lib/python3.10/dist-packages/torch/amp/autocast_mode.py", line 16, in decorate_autocast
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/unets/unet_spatio_temporal_condition.py", line 434, in forward
    sample, res_samples = downsample_block(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/unets/unet_3d_blocks.py", line 2183, in forward
    hidden_states = attn(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/transformers/transformer_temporal.py", line 359, in forward
    hidden_states_mix = temporal_block(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/attention.py", line 507, in forward
    attn_output = self.attn1(norm_hidden_states, encoder_hidden_states=None)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/attention_processor.py", line 522, in forward
    return self.processor(
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/attention_processor.py", line 1191, in __call__
    hidden_states = xformers.ops.memory_efficient_attention(
  File "/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/__init__.py", line 247, in memory_efficient_attention
    return _memory_efficient_attention(
  File "/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/__init__.py", line 370, in _memory_efficient_attention
    return _fMHA.apply(
  File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 553, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/__init__.py", line 61, in forward
    out, op_ctx = _memory_efficient_attention_forward_requires_grad(
  File "/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/__init__.py", line 398, in _memory_efficient_attention_forward_requires_grad
    out = op.apply(inp, needs_gradient=True)
  File "/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/flash.py", line 523, in apply
    out, softmax_lse, rng_state = cls.OPERATOR(
  File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 755, in __call__
    return self._op(*args, **(kwargs or {}))
  File "/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/flash.py", line 114, in _flash_fwd
    ) = _C_flashattention.fwd(
RuntimeError: CUDA error: invalid configuration argument
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Traceback (most recent call last):
  File "/SVD_XtTrain/train_svd.py", line 1255, in <module>
    main()
  File "/SVD_XtTrain/train_svd.py", line 1083, in main
    model_pred = unet(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 1523, in forward
    else self._run_ddp_forward(*inputs, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 1359, in _run_ddp_forward
    return self.module(*inputs, **kwargs)  # type: ignore[index]
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 825, in forward
    return model_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 813, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
  File "/usr/local/lib/python3.10/dist-packages/torch/amp/autocast_mode.py", line 16, in decorate_autocast
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/unets/unet_spatio_temporal_condition.py", line 434, in forward
    sample, res_samples = downsample_block(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/unets/unet_3d_blocks.py", line 2183, in forward
    hidden_states = attn(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/transformers/transformer_temporal.py", line 359, in forward
    hidden_states_mix = temporal_block(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/attention.py", line 507, in forward
    attn_output = self.attn1(norm_hidden_states, encoder_hidden_states=None)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/attention_processor.py", line 522, in forward
    return self.processor(
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/attention_processor.py", line 1191, in __call__
    hidden_states = xformers.ops.memory_efficient_attention(
  File "/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/__init__.py", line 247, in memory_efficient_attention
    return _memory_efficient_attention(
  File "/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/__init__.py", line 370, in _memory_efficient_attention
    return _fMHA.apply(
  File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 553, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/__init__.py", line 61, in forward
    out, op_ctx = _memory_efficient_attention_forward_requires_grad(
  File "/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/__init__.py", line 398, in _memory_efficient_attention_forward_requires_grad
    out = op.apply(inp, needs_gradient=True)
  File "/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/flash.py", line 523, in apply
    out, softmax_lse, rng_state = cls.OPERATOR(
  File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 755, in __call__
    return self._op(*args, **(kwargs or {}))
  File "/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/flash.py", line 114, in _flash_fwd
    ) = _C_flashattention.fwd(
RuntimeError: CUDA error: invalid configuration argument
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Traceback (most recent call last):
  File "/SVD_XtTrain/train_svd.py", line 1255, in <module>
    main()
  File "/SVD_XtTrain/train_svd.py", line 1083, in main
    model_pred = unet(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 1523, in forward
    else self._run_ddp_forward(*inputs, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 1359, in _run_ddp_forward
    return self.module(*inputs, **kwargs)  # type: ignore[index]
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 825, in forward
    return model_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 813, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
  File "/usr/local/lib/python3.10/dist-packages/torch/amp/autocast_mode.py", line 16, in decorate_autocast
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/unets/unet_spatio_temporal_condition.py", line 434, in forward
    sample, res_samples = downsample_block(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/unets/unet_3d_blocks.py", line 2183, in forward
    hidden_states = attn(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/transformers/transformer_temporal.py", line 359, in forward
    hidden_states_mix = temporal_block(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/attention.py", line 507, in forward
    attn_output = self.attn1(norm_hidden_states, encoder_hidden_states=None)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/attention_processor.py", line 522, in forward
    return self.processor(
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/attention_processor.py", line 1191, in __call__
    hidden_states = xformers.ops.memory_efficient_attention(
  File "/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/__init__.py", line 247, in memory_efficient_attention
    return _memory_efficient_attention(
  File "/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/__init__.py", line 370, in _memory_efficient_attention
    return _fMHA.apply(
  File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 553, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/__init__.py", line 61, in forward
    out, op_ctx = _memory_efficient_attention_forward_requires_grad(
  File "/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/__init__.py", line 398, in _memory_efficient_attention_forward_requires_grad
    out = op.apply(inp, needs_gradient=True)
  File "/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/flash.py", line 523, in apply
    out, softmax_lse, rng_state = cls.OPERATOR(
  File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 755, in __call__
    return self._op(*args, **(kwargs or {}))
  File "/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/flash.py", line 114, in _flash_fwd
    ) = _C_flashattention.fwd(
RuntimeError: CUDA error: invalid configuration argument
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

[2024-04-27 17:51:07,405] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 5913 closing signal SIGTERM
[2024-04-27 17:51:07,405] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 5914 closing signal SIGTERM
[2024-04-27 17:51:07,405] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 5917 closing signal SIGTERM
[2024-04-27 17:51:07,405] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 5918 closing signal SIGTERM
[2024-04-27 17:51:08,247] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 2 (pid: 5915) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py", line 46, in main
    args.func(args)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 1066, in launch_command
    multi_gpu_launcher(args)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 711, in multi_gpu_launcher
    distrib_run.run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 803, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 135, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train_svd.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-04-27_17:51:07
  host      : 7d107e50eb65
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 5916)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-04-27_17:51:07
  host      : 7d107e50eb65
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 5915)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
#### xformers disable
root@6d107f50ed63:/SVD_XtTrain# accelerate launch train_svd.py --pretrained_model_name_or_path=stable-video-diffusion-img2vid-xt-1-1 --per_gpu_batch_size=1 --gradient_accumulation_steps=1 --max_train_steps=50000  --width=1280  --height=720 --checkpointing_steps=1000 --checkpoints_total_limit=1 --learning_rate=1e-5 --lr_warmup_steps=0  --seed=123 --mixed_precision="fp16" --validation_steps=200 --use_8bit_adam
...
Mixed precision type: fp16

/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py:391: UserWarning: `log_with=tensorboard` was passed but no supported trackers are currently installed.
  warnings.warn(f"`log_with={log_with}` was passed but no supported trackers are currently installed.")
04/27/2024 17:50:48 - INFO - __main__ - Distributed environment: MULTI_GPU  Backend: nccl
Num processes: 6
Process index: 0
Local process index: 0
Device: cuda:0

Mixed precision type: fp16

{'rescale_betas_zero_snr'} was not found in config. Values will be initialized to default values.
/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py:391: UserWarning: `log_with=tensorboard` was passed but no supported trackers are currently installed.
  warnings.warn(f"`log_with={log_with}` was passed but no supported trackers are currently installed.")
04/27/2024 17:50:48 - INFO - __main__ - Distributed environment: MULTI_GPU  Backend: nccl
Num processes: 6
Process index: 3
Local process index: 3
Device: cuda:3

Mixed precision type: fp16

/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py:391: UserWarning: `log_with=tensorboard` was passed but no supported trackers are currently installed.
  warnings.warn(f"`log_with={log_with}` was passed but no supported trackers are currently installed.")
/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py:391: UserWarning: `log_with=tensorboard` was passed but no supported trackers are currently installed.
  warnings.warn(f"`log_with={log_with}` was passed but no supported trackers are currently installed.")
/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py:391: UserWarning: `log_with=tensorboard` was passed but no supported trackers are currently installed.
  warnings.warn(f"`log_with={log_with}` was passed but no supported trackers are currently installed.")
04/27/2024 17:50:48 - INFO - __main__ - Distributed environment: MULTI_GPU  Backend: nccl
Num processes: 6
Process index: 1
Local process index: 1
Device: cuda:1

Mixed precision type: fp16

04/27/2024 17:50:48 - INFO - __main__ - Distributed environment: MULTI_GPU  Backend: nccl
Num processes: 6
Process index: 4
Local process index: 4
Device: cuda:4

Mixed precision type: fp16

04/27/2024 17:50:48 - INFO - __main__ - Distributed environment: MULTI_GPU  Backend: nccl
Num processes: 6
Process index: 2
Local process index: 2
Device: cuda:2

Mixed precision type: fp16

04/27/2024 17:51:01 - INFO - __main__ - ***** Running training *****
04/27/2024 17:51:01 - INFO - __main__ -   Num examples = 100000
04/27/2024 17:51:01 - INFO - __main__ -   Num Epochs = 3
04/27/2024 17:51:01 - INFO - __main__ -   Instantaneous batch size per device = 1
04/27/2024 17:51:01 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 6
04/27/2024 17:51:01 - INFO - __main__ -   Gradient Accumulation steps = 1
04/27/2024 17:51:01 - INFO - __main__ -   Total optimization steps = 50000
Steps:   0%|                                                                                                                                                                                                                                                          | 0/50000 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/SVD_XtTrain/train_svd.py", line 1255, in <module>
    main()
  File "/SVD_XtTrain/train_svd.py", line 1083, in main
    model_pred = unet(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 1523, in forward
    else self._run_ddp_forward(*inputs, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 1359, in _run_ddp_forward
    return self.module(*inputs, **kwargs)  # type: ignore[index]
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 825, in forward
    return model_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 813, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
  File "/usr/local/lib/python3.10/dist-packages/torch/amp/autocast_mode.py", line 16, in decorate_autocast
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/unets/unet_spatio_temporal_condition.py", line 434, in forward
    sample, res_samples = downsample_block(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/unets/unet_3d_blocks.py", line 2183, in forward
    hidden_states = attn(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/transformers/transformer_temporal.py", line 359, in forward
    hidden_states_mix = temporal_block(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/attention.py", line 507, in forward
    attn_output = self.attn1(norm_hidden_states, encoder_hidden_states=None)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/attention_processor.py", line 522, in forward
    return self.processor(
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/attention_processor.py", line 1191, in __call__
    hidden_states = xformers.ops.memory_efficient_attention(
  File "/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/__init__.py", line 247, in memory_efficient_attention
    return _memory_efficient_attention(
  File "/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/__init__.py", line 370, in _memory_efficient_attention
    return _fMHA.apply(
  File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 553, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/__init__.py", line 61, in forward
    out, op_ctx = _memory_efficient_attention_forward_requires_grad(
  File "/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/__init__.py", line 398, in _memory_efficient_attention_forward_requires_grad
    out = op.apply(inp, needs_gradient=True)
  File "/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/flash.py", line 523, in apply
    out, softmax_lse, rng_state = cls.OPERATOR(
  File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 755, in __call__
    return self._op(*args, **(kwargs or {}))
  File "/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/flash.py", line 114, in _flash_fwd
    ) = _C_flashattention.fwd(
RuntimeError: CUDA error: invalid configuration argument
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Steps:   0%|                                                                                                                                                                                                                                                          | 0/50000 [00:03<?, ?it/s]
Traceback (most recent call last):
  File "/SVD_XtTrain/train_svd.py", line 1255, in <module>
    main()
  File "/SVD_XtTrain/train_svd.py", line 1083, in main
    model_pred = unet(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 1523, in forward
    else self._run_ddp_forward(*inputs, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 1359, in _run_ddp_forward
    return self.module(*inputs, **kwargs)  # type: ignore[index]
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 825, in forward
    return model_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 813, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
  File "/usr/local/lib/python3.10/dist-packages/torch/amp/autocast_mode.py", line 16, in decorate_autocast
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/unets/unet_spatio_temporal_condition.py", line 434, in forward
    sample, res_samples = downsample_block(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/unets/unet_3d_blocks.py", line 2183, in forward
    hidden_states = attn(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/transformers/transformer_temporal.py", line 359, in forward
    hidden_states_mix = temporal_block(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/attention.py", line 507, in forward
    attn_output = self.attn1(norm_hidden_states, encoder_hidden_states=None)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/attention_processor.py", line 522, in forward
    return self.processor(
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/attention_processor.py", line 1191, in __call__
    hidden_states = xformers.ops.memory_efficient_attention(
  File "/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/__init__.py", line 247, in memory_efficient_attention
    return _memory_efficient_attention(
  File "/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/__init__.py", line 370, in _memory_efficient_attention
    return _fMHA.apply(
  File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 553, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/__init__.py", line 61, in forward
    out, op_ctx = _memory_efficient_attention_forward_requires_grad(
  File "/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/__init__.py", line 398, in _memory_efficient_attention_forward_requires_grad
    out = op.apply(inp, needs_gradient=True)
  File "/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/flash.py", line 523, in apply
    out, softmax_lse, rng_state = cls.OPERATOR(
  File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 755, in __call__
    return self._op(*args, **(kwargs or {}))
  File "/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/flash.py", line 114, in _flash_fwd
    ) = _C_flashattention.fwd(
RuntimeError: CUDA error: invalid configuration argument
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Traceback (most recent call last):
  File "/SVD_XtTrain/train_svd.py", line 1255, in <module>
    main()
  File "/SVD_XtTrain/train_svd.py", line 1083, in main
    model_pred = unet(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 1523, in forward
    else self._run_ddp_forward(*inputs, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 1359, in _run_ddp_forward
    return self.module(*inputs, **kwargs)  # type: ignore[index]
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 825, in forward
    return model_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 813, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
  File "/usr/local/lib/python3.10/dist-packages/torch/amp/autocast_mode.py", line 16, in decorate_autocast
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/unets/unet_spatio_temporal_condition.py", line 434, in forward
    sample, res_samples = downsample_block(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/unets/unet_3d_blocks.py", line 2183, in forward
    hidden_states = attn(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/transformers/transformer_temporal.py", line 359, in forward
    hidden_states_mix = temporal_block(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/attention.py", line 507, in forward
    attn_output = self.attn1(norm_hidden_states, encoder_hidden_states=None)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/attention_processor.py", line 522, in forward
    return self.processor(
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/attention_processor.py", line 1191, in __call__
    hidden_states = xformers.ops.memory_efficient_attention(
  File "/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/__init__.py", line 247, in memory_efficient_attention
    return _memory_efficient_attention(
  File "/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/__init__.py", line 370, in _memory_efficient_attention
    return _fMHA.apply(
  File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 553, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/__init__.py", line 61, in forward
    out, op_ctx = _memory_efficient_attention_forward_requires_grad(
  File "/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/__init__.py", line 398, in _memory_efficient_attention_forward_requires_grad
    out = op.apply(inp, needs_gradient=True)
  File "/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/flash.py", line 523, in apply
    out, softmax_lse, rng_state = cls.OPERATOR(
  File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 755, in __call__
    return self._op(*args, **(kwargs or {}))
  File "/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/flash.py", line 114, in _flash_fwd
    ) = _C_flashattention.fwd(
RuntimeError: CUDA error: invalid configuration argument
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Traceback (most recent call last):
  File "/SVD_XtTrain/train_svd.py", line 1255, in <module>
    main()
  File "/SVD_XtTrain/train_svd.py", line 1083, in main
    model_pred = unet(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 1523, in forward
    else self._run_ddp_forward(*inputs, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 1359, in _run_ddp_forward
    return self.module(*inputs, **kwargs)  # type: ignore[index]
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 825, in forward
    return model_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 813, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
  File "/usr/local/lib/python3.10/dist-packages/torch/amp/autocast_mode.py", line 16, in decorate_autocast
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/unets/unet_spatio_temporal_condition.py", line 434, in forward
    sample, res_samples = downsample_block(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/unets/unet_3d_blocks.py", line 2183, in forward
    hidden_states = attn(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/transformers/transformer_temporal.py", line 359, in forward
    hidden_states_mix = temporal_block(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/attention.py", line 507, in forward
    attn_output = self.attn1(norm_hidden_states, encoder_hidden_states=None)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/attention_processor.py", line 522, in forward
    return self.processor(
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/attention_processor.py", line 1191, in __call__
    hidden_states = xformers.ops.memory_efficient_attention(
  File "/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/__init__.py", line 247, in memory_efficient_attention
    return _memory_efficient_attention(
  File "/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/__init__.py", line 370, in _memory_efficient_attention
    return _fMHA.apply(
  File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 553, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/__init__.py", line 61, in forward
    out, op_ctx = _memory_efficient_attention_forward_requires_grad(
  File "/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/__init__.py", line 398, in _memory_efficient_attention_forward_requires_grad
    out = op.apply(inp, needs_gradient=True)
  File "/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/flash.py", line 523, in apply
    out, softmax_lse, rng_state = cls.OPERATOR(
  File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 755, in __call__
    return self._op(*args, **(kwargs or {}))
  File "/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/flash.py", line 114, in _flash_fwd
    ) = _C_flashattention.fwd(
RuntimeError: CUDA error: invalid configuration argument
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Traceback (most recent call last):
  File "/SVD_XtTrain/train_svd.py", line 1255, in <module>
    main()
  File "/SVD_XtTrain/train_svd.py", line 1083, in main
    model_pred = unet(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 1523, in forward
    else self._run_ddp_forward(*inputs, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 1359, in _run_ddp_forward
    return self.module(*inputs, **kwargs)  # type: ignore[index]
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 825, in forward
    return model_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 813, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
  File "/usr/local/lib/python3.10/dist-packages/torch/amp/autocast_mode.py", line 16, in decorate_autocast
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/unets/unet_spatio_temporal_condition.py", line 434, in forward
    sample, res_samples = downsample_block(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/unets/unet_3d_blocks.py", line 2183, in forward
    hidden_states = attn(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/transformers/transformer_temporal.py", line 359, in forward
    hidden_states_mix = temporal_block(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/attention.py", line 507, in forward
    attn_output = self.attn1(norm_hidden_states, encoder_hidden_states=None)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/attention_processor.py", line 522, in forward
    return self.processor(
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/attention_processor.py", line 1191, in __call__
    hidden_states = xformers.ops.memory_efficient_attention(
  File "/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/__init__.py", line 247, in memory_efficient_attention
    return _memory_efficient_attention(
  File "/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/__init__.py", line 370, in _memory_efficient_attention
    return _fMHA.apply(
  File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 553, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/__init__.py", line 61, in forward
    out, op_ctx = _memory_efficient_attention_forward_requires_grad(
  File "/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/__init__.py", line 398, in _memory_efficient_attention_forward_requires_grad
    out = op.apply(inp, needs_gradient=True)
  File "/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/flash.py", line 523, in apply
    out, softmax_lse, rng_state = cls.OPERATOR(
  File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 755, in __call__
    return self._op(*args, **(kwargs or {}))
  File "/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/flash.py", line 114, in _flash_fwd
    ) = _C_flashattention.fwd(
RuntimeError: CUDA error: invalid configuration argument
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Traceback (most recent call last):
  File "/SVD_XtTrain/train_svd.py", line 1255, in <module>
    main()
  File "/SVD_XtTrain/train_svd.py", line 1083, in main
    model_pred = unet(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 1523, in forward
    else self._run_ddp_forward(*inputs, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 1359, in _run_ddp_forward
    return self.module(*inputs, **kwargs)  # type: ignore[index]
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 825, in forward
    return model_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 813, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
  File "/usr/local/lib/python3.10/dist-packages/torch/amp/autocast_mode.py", line 16, in decorate_autocast
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/unets/unet_spatio_temporal_condition.py", line 434, in forward
    sample, res_samples = downsample_block(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/unets/unet_3d_blocks.py", line 2183, in forward
    hidden_states = attn(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/transformers/transformer_temporal.py", line 359, in forward
    hidden_states_mix = temporal_block(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/attention.py", line 507, in forward
    attn_output = self.attn1(norm_hidden_states, encoder_hidden_states=None)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/attention_processor.py", line 522, in forward
    return self.processor(
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/attention_processor.py", line 1191, in __call__
    hidden_states = xformers.ops.memory_efficient_attention(
  File "/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/__init__.py", line 247, in memory_efficient_attention
    return _memory_efficient_attention(
  File "/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/__init__.py", line 370, in _memory_efficient_attention
    return _fMHA.apply(
  File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 553, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/__init__.py", line 61, in forward
    out, op_ctx = _memory_efficient_attention_forward_requires_grad(
  File "/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/__init__.py", line 398, in _memory_efficient_attention_forward_requires_grad
    out = op.apply(inp, needs_gradient=True)
  File "/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/flash.py", line 523, in apply
    out, softmax_lse, rng_state = cls.OPERATOR(
  File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 755, in __call__
    return self._op(*args, **(kwargs or {}))
  File "/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/flash.py", line 114, in _flash_fwd
    ) = _C_flashattention.fwd(
RuntimeError: CUDA error: invalid configuration argument
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

[2024-04-27 17:51:07,405] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 5913 closing signal SIGTERM
[2024-04-27 17:51:07,405] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 5914 closing signal SIGTERM
[2024-04-27 17:51:07,405] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 5917 closing signal SIGTERM
[2024-04-27 17:51:07,405] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 5918 closing signal SIGTERM
[2024-04-27 17:51:08,247] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 2 (pid: 5915) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py", line 46, in main
    args.func(args)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 1066, in launch_command
    multi_gpu_launcher(args)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 711, in multi_gpu_launcher
    distrib_run.run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 803, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 135, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train_svd.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-04-27_17:51:07
  host      : 7d107e50eb65
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 5916)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-04-27_17:51:07
  host      : 7d107e50eb65
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 5915)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
root@6d107f50ed63:/SVD_XtTrain# accelerate launch train_svd.py --pretrained_model_name_or_path=stable-video-diffusion-img2vid-xt-1-1 --per_gpu_batch_size=1 --gradient_accumulation_steps=1 --max_train_steps=50000  --width=1280  --height=720 --checkpointing_steps=1000 --checkpoints_total_limit=1 --learning_rate=1e-5 --lr_warmup_steps=0  --seed=123 --mixed_precision="fp16" --validation_steps=200 --use_8bit_adam
/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py:391: UserWarning: `log_with=tensorboard` was passed but no supported trackers are currently installed.
  warnings.warn(f"`log_with={log_with}` was passed but no supported trackers are currently installed.")
04/27/2024 17:52:22 - INFO - __main__ - Distributed environment: MULTI_GPU  Backend: nccl
Num processes: 6
Process index: 0
Local process index: 0
Device: cuda:0

Mixed precision type: fp16

{'rescale_betas_zero_snr'} was not found in config. Values will be initialized to default values.
/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py:391: UserWarning: `log_with=tensorboard` was passed but no supported trackers are currently installed.
  warnings.warn(f"`log_with={log_with}` was passed but no supported trackers are currently installed.")
04/27/2024 17:52:23 - INFO - __main__ - Distributed environment: MULTI_GPU  Backend: nccl
Num processes: 6
Process index: 1
Local process index: 1
Device: cuda:1

Mixed precision type: fp16

/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py:391: UserWarning: `log_with=tensorboard` was passed but no supported trackers are currently installed.
  warnings.warn(f"`log_with={log_with}` was passed but no supported trackers are currently installed.")
04/27/2024 17:52:23 - INFO - __main__ - Distributed environment: MULTI_GPU  Backend: nccl
Num processes: 6
Process index: 4
Local process index: 4
Device: cuda:4

Mixed precision type: fp16

/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py:391: UserWarning: `log_with=tensorboard` was passed but no supported trackers are currently installed.
  warnings.warn(f"`log_with={log_with}` was passed but no supported trackers are currently installed.")
04/27/2024 17:52:23 - INFO - __main__ - Distributed environment: MULTI_GPU  Backend: nccl
Num processes: 6
Process index: 5
Local process index: 5
Device: cuda:5

Mixed precision type: fp16

/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py:391: UserWarning: `log_with=tensorboard` was passed but no supported trackers are currently installed.
  warnings.warn(f"`log_with={log_with}` was passed but no supported trackers are currently installed.")
04/27/2024 17:52:23 - INFO - __main__ - Distributed environment: MULTI_GPU  Backend: nccl
Num processes: 6
Process index: 2
Local process index: 2
Device: cuda:2

Mixed precision type: fp16

/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py:391: UserWarning: `log_with=tensorboard` was passed but no supported trackers are currently installed.
  warnings.warn(f"`log_with={log_with}` was passed but no supported trackers are currently installed.")
04/27/2024 17:52:23 - INFO - __main__ - Distributed environment: MULTI_GPU  Backend: nccl
Num processes: 6
Process index: 3
Local process index: 3
Device: cuda:3

Mixed precision type: fp16

04/27/2024 17:52:36 - INFO - __main__ - ***** Running training *****
04/27/2024 17:52:36 - INFO - __main__ -   Num examples = 100000
04/27/2024 17:52:36 - INFO - __main__ -   Num Epochs = 3
04/27/2024 17:52:36 - INFO - __main__ -   Instantaneous batch size per device = 1
04/27/2024 17:52:36 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 6
04/27/2024 17:52:36 - INFO - __main__ -   Gradient Accumulation steps = 1
04/27/2024 17:52:36 - INFO - __main__ -   Total optimization steps = 50000
Steps:   0%|                                                                                                                                                                                                                                                          | 0/50000 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/SVD_XtTrain/train_svd.py", line 1255, in <module>
    main()
  File "/SVD_XtTrain/train_svd.py", line 1083, in main
    model_pred = unet(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 1523, in forward
    else self._run_ddp_forward(*inputs, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 1359, in _run_ddp_forward
    return self.module(*inputs, **kwargs)  # type: ignore[index]
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 825, in forward
    return model_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 813, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
  File "/usr/local/lib/python3.10/dist-packages/torch/amp/autocast_mode.py", line 16, in decorate_autocast
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/unets/unet_spatio_temporal_condition.py", line 441, in forward
    sample, res_samples = downsample_block(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/unets/unet_3d_blocks.py", line 2062, in forward
    hidden_states = resnet(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/resnet.py", line 713, in forward
    hidden_states = self.temporal_res_block(hidden_states, temb)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/resnet.py", line 627, in forward
    hidden_states = self.nonlinearity(hidden_states)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/activation.py", line 393, in forward
    return F.silu(input, inplace=self.inplace)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/functional.py", line 2075, in silu
    return torch._C._nn.silu(input)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 30.00 MiB. GPU 4 has a total capacity of 79.10 GiB of which 28.00 MiB is free. Process 2038003 has 79.06 GiB memory in use. Of the allocated memory 75.21 GiB is allocated by PyTorch, and 384.38 MiB is reserved by PyTorchbut unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Traceback (most recent call last):
  File "/SVD_XtTrain/train_svd.py", line 1255, in <module>
    main()
  File "/SVD_XtTrain/train_svd.py", line 1083, in main
    model_pred = unet(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 1523, in forward
    else self._run_ddp_forward(*inputs, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 1359, in _run_ddp_forward
    return self.module(*inputs, **kwargs)  # type: ignore[index]
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 825, in forward
    return model_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 813, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
  File "/usr/local/lib/python3.10/dist-packages/torch/amp/autocast_mode.py", line 16, in decorate_autocast
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/unets/unet_spatio_temporal_condition.py", line 450, in forward
    sample = self.mid_block(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/unets/unet_3d_blocks.py", line 1941, in forward
    hidden_states = self.resnets[0](
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/resnet.py", line 698, in forward
    hidden_states = self.spatial_res_block(hidden_states, temb)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/resnet.py", line 332, in forward
    hidden_states = self.norm1(hidden_states)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/normalization.py", line 287, in forward
    return F.group_norm(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/functional.py", line 2561, in group_norm
    return torch.group_norm(input, num_groups, weight, bias, eps, torch.backends.cudnn.enabled)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 30.00 MiB. GPU 1 has a total capacity of 79.10 GiB of which 8.00 MiB is free. Process 2038000 has 79.08 GiB memory in use. Of the allocated memory 75.26 GiB is allocated by PyTorch, and 354.03 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Traceback (most recent call last):
  File "/SVD_XtTrain/train_svd.py", line 1255, in <module>
    main()
  File "/SVD_XtTrain/train_svd.py", line 1083, in main
    model_pred = unet(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 1523, in forward
    else self._run_ddp_forward(*inputs, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 1359, in _run_ddp_forward
    return self.module(*inputs, **kwargs)  # type: ignore[index]
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 825, in forward
    return model_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 813, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
  File "/usr/local/lib/python3.10/dist-packages/torch/amp/autocast_mode.py", line 16, in decorate_autocast
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/unets/unet_spatio_temporal_condition.py", line 441, in forward
    sample, res_samples = downsample_block(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/unets/unet_3d_blocks.py", line 2062, in forward
    hidden_states = resnet(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/resnet.py", line 713, in forward
    hidden_states = self.temporal_res_block(hidden_states, temb)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/resnet.py", line 627, in forward
    hidden_states = self.nonlinearity(hidden_states)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/activation.py", line 393, in forward
    return F.silu(input, inplace=self.inplace)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/functional.py", line 2075, in silu
    return torch._C._nn.silu(input)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 30.00 MiB. GPU 3 has a total capacity of 79.10 GiB of which 2.00 MiB is free. Process 2038002 has 79.09 GiB memory in use. Of the allocated memory 75.21 GiB is allocated by PyTorch, and 412.45 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Traceback (most recent call last):
  File "/SVD_XtTrain/train_svd.py", line 1255, in <module>
    main()
  File "/SVD_XtTrain/train_svd.py", line 1083, in main
    model_pred = unet(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 1523, in forward
    else self._run_ddp_forward(*inputs, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 1359, in _run_ddp_forward
    return self.module(*inputs, **kwargs)  # type: ignore[index]
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 825, in forward
    return model_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 813, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
  File "/usr/local/lib/python3.10/dist-packages/torch/amp/autocast_mode.py", line 16, in decorate_autocast
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/unets/unet_spatio_temporal_condition.py", line 450, in forward
    sample = self.mid_block(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/unets/unet_3d_blocks.py", line 1941, in forward
    hidden_states = self.resnets[0](
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/resnet.py", line 698, in forward
    hidden_states = self.spatial_res_block(hidden_states, temb)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/resnet.py", line 332, in forward
    hidden_states = self.norm1(hidden_states)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/normalization.py", line 287, in forward
    return F.group_norm(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/functional.py", line 2561, in group_norm
    return torch.group_norm(input, num_groups, weight, bias, eps, torch.backends.cudnn.enabled)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 30.00 MiB. GPU 2 has a total capacity of 79.10 GiB of which 12.00 MiB is free. Process 2038001 has 79.08 GiB memory in use. Of the allocated memory 75.26 GiB is allocated by PyTorch, and 349.80 MiB is reserved by PyTorchbut unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Traceback (most recent call last):
  File "/SVD_XtTrain/train_svd.py", line 1255, in <module>
    main()
  File "/SVD_XtTrain/train_svd.py", line 1083, in main
    model_pred = unet(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 1523, in forward
    else self._run_ddp_forward(*inputs, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 1359, in _run_ddp_forward
    return self.module(*inputs, **kwargs)  # type: ignore[index]
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 825, in forward
    return model_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 813, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
  File "/usr/local/lib/python3.10/dist-packages/torch/amp/autocast_mode.py", line 16, in decorate_autocast
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/unets/unet_spatio_temporal_condition.py", line 450, in forward
    sample = self.mid_block(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/unets/unet_3d_blocks.py", line 1941, in forward
    hidden_states = self.resnets[0](
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/resnet.py", line 698, in forward
    hidden_states = self.spatial_res_block(hidden_states, temb)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/resnet.py", line 332, in forward
    hidden_states = self.norm1(hidden_states)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/normalization.py", line 287, in forward
    return F.group_norm(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/functional.py", line 2561, in group_norm
    return torch.group_norm(input, num_groups, weight, bias, eps, torch.backends.cudnn.enabled)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 30.00 MiB. GPU 0 has a total capacity of 79.10 GiB of which 26.00 MiB is free. Process 2037999 has 79.06 GiB memory in use. Of the allocated memory 75.26 GiB is allocated by PyTorch, and 383.27 MiB is reserved by PyTorchbut unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Steps:   0%|                                                                                                                                                                                                                                                          | 0/50000 [00:04<?, ?it/s]
Traceback (most recent call last):
  File "/SVD_XtTrain/train_svd.py", line 1255, in <module>
    main()
  File "/SVD_XtTrain/train_svd.py", line 1083, in main
    model_pred = unet(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 1523, in forward
    else self._run_ddp_forward(*inputs, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 1359, in _run_ddp_forward
    return self.module(*inputs, **kwargs)  # type: ignore[index]
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 825, in forward
    return model_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 813, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
  File "/usr/local/lib/python3.10/dist-packages/torch/amp/autocast_mode.py", line 16, in decorate_autocast
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/unets/unet_spatio_temporal_condition.py", line 450, in forward
    sample = self.mid_block(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/unets/unet_3d_blocks.py", line 1941, in forward
    hidden_states = self.resnets[0](
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/resnet.py", line 713, in forward
    hidden_states = self.temporal_res_block(hidden_states, temb)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/resnet.py", line 616, in forward
    hidden_states = self.norm1(hidden_states)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/normalization.py", line 287, in forward
    return F.group_norm(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/functional.py", line 2561, in group_norm
    return torch.group_norm(input, num_groups, weight, bias, eps, torch.backends.cudnn.enabled)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 30.00 MiB. GPU 5 has a total capacity of 79.10 GiB of which 22.00 MiB is free. Process 2038004 has 79.07 GiB memory in use. Of the allocated memory 75.48 GiB is allocated by PyTorch, and 354.33 MiB is reserved by PyTorchbut unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[2024-04-27 17:52:42,688] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 9202) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py", line 46, in main
    args.func(args)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 1066, in launch_command
    multi_gpu_launcher(args)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 711, in multi_gpu_launcher
    distrib_run.run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 803, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 135, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train_svd.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-04-27_17:52:42
  host      : 7d107e50eb65
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 9203)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2024-04-27_17:52:42
  host      : 7d107e50eb65
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 9204)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2024-04-27_17:52:42
  host      : 7d107e50eb65
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 9205)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[4]:
  time      : 2024-04-27_17:52:42
  host      : 7d107e50eb65
  rank      : 4 (local_rank: 4)
  exitcode  : 1 (pid: 9206)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[5]:
  time      : 2024-04-27_17:52:42
  host      : 7d107e50eb65
  rank      : 5 (local_rank: 5)
  exitcode  : 1 (pid: 9207)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-04-27_17:52:42
  host      : 7d107e50eb65
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 9202)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
root@6d107f50ed63:/SVD_XtTrain#

Edit: Additionally, a resolution of 640x360 (dataset + training) encounters tensor errors, suggesting the ratio is not scalable?

pixeli99 / SVD_Xtend

[CUDA out of memory] training in 1024 × 576 resolution in the A100 80G #39