train_controlnet_sdxl.py Multi-GPU OutOfMemory

asutermo commented 7 months ago

Describe the bug

Hi there. I've reliably used the train_controlnet_sdxl.py on single gpu on GCP (A100 - 40 GB). I have had to switch to AWS and am presently using a p3.8xlarge which has 4 V100 gpus w/ 64 GB GPU memory total.

Whenever I run my workflow on AWS I get a Cuda out of memory error just loading the dataset. I built bitsandbytes following HuggingFace's documentation. Again, this works on single GPU just fine.

Reproduction

Possible issues:

Could not find the bitsandbytes CUDA binary at PosixPath('/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117_nocublaslt.so'). It does seem it uses CUDA though as evidenced down below

accelerate command 1:

accelerate launch --multi_gpu /home/ubuntu/src/mvp/backend/train/huggingface_train_scripts/train_controlnet_sdxl.py --pretrained_model_name_or_path stabilityai/stable-diffusion-xl-base-1.0 --instance_data_dir /tmp/fai_cache/data/data --pretrained_vae_model_name_or_path madebyollin/sdxl-vae-fp16-fix --output_dir /tmp/demo-20240423032204 --resolution 512 --train_batch_size 1 --gradient_accumulation_steps 4 --learning_rate 1e-4 --lr_scheduler constant --lr_warmup_steps 0 --max_train_steps 1000 --checkpointing_steps 1000 --seed 0 --gradient_checkpointing --checkpoints_total_limit 3 --use_8bit_adam --enable_xformers_memory_efficient_attention --set_grads_to_none --dataloader_num_workers 4'

accelerate command 2:

accelerate launch --multi_gpu /home/ubuntu/src/mvp/backend/train/huggingface_train_scripts/train_controlnet_sdxl.py --pretrained_model_name_or_path stabilityai/stable-diffusion-xl-base-1.0 --instance_data_dir /tmp/fai_cache/data/data --pretrained_vae_model_name_or_path madebyollin/sdxl-vae-fp16-fix --output_dir /tmp/demo-20240423032204 --resolution 512 --train_batch_size 1 --gradient_accumulation_steps 4 --learning_rate 1e-4 --lr_scheduler constant --lr_warmup_steps 0 --max_train_steps 1000 --checkpointing_steps 1000 --seed 0 --gradient_checkpointing --checkpoints_total_limit 3 --use_8bit_adam --enable_xformers_memory_efficient_attention --set_grads_to_none

Logs

2024-04-23 03:22:04 - train.train - INFO - Running ['accelerate', 'launch', '--multi_gpu', '/home/ubuntu/src/mvp/backend/train/huggingface_train_scripts/train_controlnet_sdxl.py', '--pretrained_model_name_or_path', 'stabilityai/stable-diffusion-xl-base-1.0', '--instance_data_dir', '/tmp/fai_cache/data/data', '--pretrained_vae_model_name_or_path', 'madebyollin/sdxl-vae-fp16-fix', '--output_dir', '/tmp/demo-20240423032204', '--resolution', '512', '--train_batch_size', '1', '--gradient_accumulation_steps', '4', '--learning_rate', '1e-4', '--lr_scheduler', 'constant', '--lr_warmup_steps', '0', '--max_train_steps', '1000', '--checkpointing_steps', '1000', '--seed', '0', '--gradient_checkpointing', '--checkpoints_total_limit', '3', '--use_8bit_adam', '--enable_xformers_memory_efficient_attention', '--set_grads_to_none', '--dataloader_num_workers', '4']
2024-04-23 03:22:48 - root - INFO - Could not find the bitsandbytes CUDA binary at PosixPath('/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117_nocublaslt.so')
Could not find the bitsandbytes CUDA binary at PosixPath('/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117_nocublaslt.so')
04/23/2024 03:22:12 - INFO - __main__ - Distributed environment: MULTI_GPU  Backend: nccl
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda:0

Mixed precision type: fp16

You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
{'variance_type', 'clip_sample_range', 'dynamic_thresholding_ratio', 'rescale_betas_zero_snr', 'thresholding'} was not found in config. Values will be initialized to default values.
{'latents_mean', 'latents_std'} was not found in config. Values will be initialized to default values.
{'dropout', 'attention_type', 'reverse_transformer_layers_per_block'} was not found in config. Values will be initialized to default values.
04/23/2024 03:22:23 - INFO - __main__ - Initializing controlnet weights from unet

Map:   0%|          | 0/29 [00:00<?, ? examples/s]
Map: 100%|██████████| 29/29 [00:00<00:00, 50.84 examples/s]
Map: 100%|██████████| 29/29 [00:00<00:00, 49.14 examples/s]
Traceback (most recent call last):
  File "/home/ubuntu/src/mvp/backend/train/huggingface_train_scripts/train_controlnet_sdxl.py", line 1438, in <module>
    main(args)
  File "/home/ubuntu/src/mvp/backend/train/huggingface_train_scripts/train_controlnet_sdxl.py", line 1192, in main
    controlnet, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1202, in prepare
    result = tuple(
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1203, in <genexpr>
    self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1030, in _prepare_one
    return self.prepare_model(obj, device_placement=device_placement)
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1340, in prepare_model
    model = torch.nn.parallel.DistributedDataParallel(
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 688, in __init__
    self._ddp_init_helper(
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 825, in _ddp_init_helper
    self.reducer = dist.Reducer(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.66 GiB (GPU 0; 15.77 GiB total capacity; 11.28 GiB already allocated; 3.52 GiB free; 11.59 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 109689) of binary: /home/ubuntu/anaconda3/envs/mvp/bin/python
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/mvp/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
    args.func(args)
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/commands/launch.py", line 970, in launch_command
    multi_gpu_launcher(args)
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/commands/launch.py", line 646, in multi_gpu_launcher
    distrib_run.run(args)
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/home/ubuntu/src/mvp/backend/train/huggingface_train_scripts/train_controlnet_sdxl.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-04-23_03:22:47
  host      :
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 109689)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

System Info

diffusers version: 0.28.0.dev0
Platform: Linux-6.5.0-1018-aws-x86_64-with-glibc2.35
Python version: 3.10.6
PyTorch version (GPU?): 2.0.1+cu117 (True)
Huggingface_hub version: 0.22.2
Transformers version: 4.31.0
Accelerate version: 0.21.0
xFormers version: 0.0.22
Using GPU in script?: yes
Using distributed or parallel set-up in script?: yes

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.171.04             Driver Version: 535.171.04   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla V100-SXM2-16GB           Off | 00000000:00:1B.0 Off |                    0 |
| N/A   35C    P0              36W / 300W |      0MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2-16GB           Off | 00000000:00:1C.0 Off |                    0 |
| N/A   38C    P0              37W / 300W |      0MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2-16GB           Off | 00000000:00:1D.0 Off |                    0 |
| N/A   34C    P0              38W / 300W |      0MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2-16GB           Off | 00000000:00:1E.0 Off |                    0 |
| N/A   34C    P0              37W / 300W |      0MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |

Who can help?

@sayakpaul @yiyixuxu

asutermo commented 7 months ago

Sure, here's the result

2024-04-23 21:06:22 - train.train - INFO - Running ['accelerate', 'launch', '--multi_gpu', '--num_processes=4', '/home/ubuntu/src/mvp/backend/train/huggingface_train_scripts/train_controlnet_sdxl.py', '--pretrained_model_name_or_path', 'stabilityai/stable-diffusion-xl-base-1.0', '--instance_data_dir', '/tmp/fai_cache/data/data', '--pretrained_vae_model_name_or_path', 'madebyollin/sdxl-vae-fp16-fix', '--output_dir', '/tmp/demo-20240423210622', '--resolution', '512', '--train_batch_size', '1', '--gradient_accumulation_steps', '4', '--learning_rate', '1e-4', '--lr_scheduler', 'constant', '--lr_warmup_steps', '0', '--max_train_steps', '1000', '--checkpointing_steps', '1000', '--seed', '0', '--gradient_checkpointing', '--checkpoints_total_limit', '3', '--use_8bit_adam', '--enable_xformers_memory_efficient_attention', '--set_grads_to_none']
2024-04-23 21:08:51 - root - INFO - Could not find the bitsandbytes CUDA binary at PosixPath('/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117_nocublaslt.so')
Could not find the bitsandbytes CUDA binary at PosixPath('/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117_nocublaslt.so')
Could not find the bitsandbytes CUDA binary at PosixPath('/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117_nocublaslt.so')
Could not find the bitsandbytes CUDA binary at PosixPath('/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117_nocublaslt.so')
Could not find the bitsandbytes CUDA binary at PosixPath('/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117_nocublaslt.so')
04/23/2024 21:06:30 - INFO - __main__ - Distributed environment: MULTI_GPU  Backend: nccl
Num processes: 4
Process index: 2
Local process index: 2
Device: cuda:2

Mixed precision type: fp16

04/23/2024 21:06:30 - INFO - __main__ - Distributed environment: MULTI_GPU  Backend: nccl
Num processes: 4
Process index: 0
Local process index: 0
Device: cuda:0

Mixed precision type: fp16

04/23/2024 21:06:30 - INFO - __main__ - Distributed environment: MULTI_GPU  Backend: nccl
Num processes: 4
Process index: 1
Local process index: 1
Device: cuda:1

Mixed precision type: fp16

04/23/2024 21:06:30 - INFO - __main__ - Distributed environment: MULTI_GPU  Backend: nccl
Num processes: 4
Process index: 3
Local process index: 3
Device: cuda:3

Mixed precision type: fp16

You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
{'variance_type', 'dynamic_thresholding_ratio', 'rescale_betas_zero_snr', 'clip_sample_range', 'thresholding'} was not found in config. Values will be initialized to default values.
{'latents_mean', 'latents_std'} was not found in config. Values will be initialized to default values.
{'dropout', 'reverse_transformer_layers_per_block', 'attention_type'} was not found in config. Values will be initialized to default values.
04/23/2024 21:07:06 - INFO - __main__ - Initializing controlnet weights from unet

Map:   0%|          | 0/29 [00:00<?, ? examples/s]
Map: 100%|██████████| 29/29 [00:02<00:00, 13.42 examples/s]
Map: 100%|██████████| 29/29 [00:02<00:00, 13.19 examples/s]
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
  File "/home/ubuntu/src/mvp/backend/train/huggingface_train_scripts/train_controlnet_sdxl.py", line 1438, in <module>
  File "/home/ubuntu/src/mvp/backend/train/huggingface_train_scripts/train_controlnet_sdxl.py", line 1438, in <module>
Traceback (most recent call last):
  File "/home/ubuntu/src/mvp/backend/train/huggingface_train_scripts/train_controlnet_sdxl.py", line 1438, in <module>
  File "/home/ubuntu/src/mvp/backend/train/huggingface_train_scripts/train_controlnet_sdxl.py", line 1438, in <module>
            main(args)    main(args)main(args)
main(args)

  File "/home/ubuntu/src/mvp/backend/train/huggingface_train_scripts/train_controlnet_sdxl.py", line 1192, in main
  File "/home/ubuntu/src/mvp/backend/train/huggingface_train_scripts/train_controlnet_sdxl.py", line 1192, in main
  File "/home/ubuntu/src/mvp/backend/train/huggingface_train_scripts/train_controlnet_sdxl.py", line 1192, in main
  File "/home/ubuntu/src/mvp/backend/train/huggingface_train_scripts/train_controlnet_sdxl.py", line 1192, in main
    controlnet, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
        controlnet, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(controlnet, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1202, in prepare

    controlnet, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1202, in prepare

  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1202, in prepare
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1202, in prepare
            result = tuple(result = tuple(result = tuple(

  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1203, in <genexpr>
      File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1203, in <genexpr>
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1203, in <genexpr>
result = tuple(
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1203, in <genexpr>
    self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
    self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1030, in _prepare_one

self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1030, in _prepare_one
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1030, in _prepare_one
    self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1030, in _prepare_one
    return self.prepare_model(obj, device_placement=device_placement)
      File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1340, in prepare_model
return self.prepare_model(obj, device_placement=device_placement)
      File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1340, in prepare_model
return self.prepare_model(obj, device_placement=device_placement)
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1340, in prepare_model
    return self.prepare_model(obj, device_placement=device_placement)
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1340, in prepare_model
    model = torch.nn.parallel.DistributedDataParallel(
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 688, in __init__
    model = torch.nn.parallel.DistributedDataParallel(
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 688, in __init__
    model = torch.nn.parallel.DistributedDataParallel(
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 688, in __init__
    model = torch.nn.parallel.DistributedDataParallel(
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 688, in __init__
    self._ddp_init_helper(
    self._ddp_init_helper(  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 825, in _ddp_init_helper

self._ddp_init_helper(
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 825, in _ddp_init_helper
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 825, in _ddp_init_helper
    self._ddp_init_helper(
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 825, in _ddp_init_helper
    self.reducer = dist.Reducer(
    self.reducer = dist.Reducer(
self.reducer = dist.Reducer(torch.cuda
.OutOfMemoryErrortorch.cuda: .torch.cudaCUDA out of memory. Tried to allocate 4.66 GiB (GPU 0; 15.77 GiB total capacity; 11.28 GiB already allocated; 3.58 GiB free; 11.59 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONFOutOfMemoryError.
: OutOfMemoryErrorCUDA out of memory. Tried to allocate 4.66 GiB (GPU 2; 15.77 GiB total capacity; 11.28 GiB already allocated; 3.53 GiB free; 11.57 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF:
    CUDA out of memory. Tried to allocate 4.66 GiB (GPU 1; 15.77 GiB total capacity; 11.28 GiB already allocated; 3.61 GiB free; 11.57 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONFself.reducer = dist.Reducer(

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.66 GiB (GPU 3; 15.77 GiB total capacity; 11.28 GiB already allocated; 3.53 GiB free; 11.57 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 12126) of binary: /home/ubuntu/anaconda3/envs/mvp/bin/python
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/mvp/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
    args.func(args)
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/commands/launch.py", line 970, in launch_command
    multi_gpu_launcher(args)
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/commands/launch.py", line 646, in multi_gpu_launcher
    distrib_run.run(args)
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/home/ubuntu/src/mvp/backend/train/huggingface_train_scripts/train_controlnet_sdxl.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-04-23_21:08:51
  host      : ip-172-31-28-3.us-west-2.compute.internal
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 12127)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2024-04-23_21:08:51
  host      : ip-172-31-28-3.us-west-2.compute.internal
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 12128)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2024-04-23_21:08:51
  host      : ip-172-31-28-3.us-west-2.compute.internal
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 12129)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-04-23_21:08:51
  host      : ip-172-31-28-3.us-west-2.compute.internal
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 12126)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

2024-04-23 21:08:51 - root - INFO - Command exited with code 1