huggingface / diffusers

πŸ€— Diffusers: State-of-the-art diffusion models for image and audio generation in PyTorch and FLAX.
https://huggingface.co/docs/diffusers
Apache License 2.0
26.25k stars 5.41k forks source link

train_controlnet_sdxl.py Multi-GPU OutOfMemory #7757

Open asutermo opened 7 months ago

asutermo commented 7 months ago

Describe the bug

Hi there. I've reliably used the train_controlnet_sdxl.py on single gpu on GCP (A100 - 40 GB). I have had to switch to AWS and am presently using a p3.8xlarge which has 4 V100 gpus w/ 64 GB GPU memory total.

Whenever I run my workflow on AWS I get a Cuda out of memory error just loading the dataset. I built bitsandbytes following HuggingFace's documentation. Again, this works on single GPU just fine.

Reproduction

Possible issues:

accelerate command 1:

accelerate launch --multi_gpu /home/ubuntu/src/mvp/backend/train/huggingface_train_scripts/train_controlnet_sdxl.py --pretrained_model_name_or_path stabilityai/stable-diffusion-xl-base-1.0 --instance_data_dir /tmp/fai_cache/data/data --pretrained_vae_model_name_or_path madebyollin/sdxl-vae-fp16-fix --output_dir /tmp/demo-20240423032204 --resolution 512 --train_batch_size 1 --gradient_accumulation_steps 4 --learning_rate 1e-4 --lr_scheduler constant --lr_warmup_steps 0 --max_train_steps 1000 --checkpointing_steps 1000 --seed 0 --gradient_checkpointing --checkpoints_total_limit 3 --use_8bit_adam --enable_xformers_memory_efficient_attention --set_grads_to_none --dataloader_num_workers 4'

accelerate command 2:

accelerate launch --multi_gpu /home/ubuntu/src/mvp/backend/train/huggingface_train_scripts/train_controlnet_sdxl.py --pretrained_model_name_or_path stabilityai/stable-diffusion-xl-base-1.0 --instance_data_dir /tmp/fai_cache/data/data --pretrained_vae_model_name_or_path madebyollin/sdxl-vae-fp16-fix --output_dir /tmp/demo-20240423032204 --resolution 512 --train_batch_size 1 --gradient_accumulation_steps 4 --learning_rate 1e-4 --lr_scheduler constant --lr_warmup_steps 0 --max_train_steps 1000 --checkpointing_steps 1000 --seed 0 --gradient_checkpointing --checkpoints_total_limit 3 --use_8bit_adam --enable_xformers_memory_efficient_attention --set_grads_to_none

Logs

2024-04-23 03:22:04 - train.train - INFO - Running ['accelerate', 'launch', '--multi_gpu', '/home/ubuntu/src/mvp/backend/train/huggingface_train_scripts/train_controlnet_sdxl.py', '--pretrained_model_name_or_path', 'stabilityai/stable-diffusion-xl-base-1.0', '--instance_data_dir', '/tmp/fai_cache/data/data', '--pretrained_vae_model_name_or_path', 'madebyollin/sdxl-vae-fp16-fix', '--output_dir', '/tmp/demo-20240423032204', '--resolution', '512', '--train_batch_size', '1', '--gradient_accumulation_steps', '4', '--learning_rate', '1e-4', '--lr_scheduler', 'constant', '--lr_warmup_steps', '0', '--max_train_steps', '1000', '--checkpointing_steps', '1000', '--seed', '0', '--gradient_checkpointing', '--checkpoints_total_limit', '3', '--use_8bit_adam', '--enable_xformers_memory_efficient_attention', '--set_grads_to_none', '--dataloader_num_workers', '4']
2024-04-23 03:22:48 - root - INFO - Could not find the bitsandbytes CUDA binary at PosixPath('/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117_nocublaslt.so')
Could not find the bitsandbytes CUDA binary at PosixPath('/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117_nocublaslt.so')
04/23/2024 03:22:12 - INFO - __main__ - Distributed environment: MULTI_GPU  Backend: nccl
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda:0

Mixed precision type: fp16

You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
{'variance_type', 'clip_sample_range', 'dynamic_thresholding_ratio', 'rescale_betas_zero_snr', 'thresholding'} was not found in config. Values will be initialized to default values.
{'latents_mean', 'latents_std'} was not found in config. Values will be initialized to default values.
{'dropout', 'attention_type', 'reverse_transformer_layers_per_block'} was not found in config. Values will be initialized to default values.
04/23/2024 03:22:23 - INFO - __main__ - Initializing controlnet weights from unet

Map:   0%|          | 0/29 [00:00<?, ? examples/s]
Map: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 29/29 [00:00<00:00, 50.84 examples/s]
Map: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 29/29 [00:00<00:00, 49.14 examples/s]
Traceback (most recent call last):
  File "/home/ubuntu/src/mvp/backend/train/huggingface_train_scripts/train_controlnet_sdxl.py", line 1438, in <module>
    main(args)
  File "/home/ubuntu/src/mvp/backend/train/huggingface_train_scripts/train_controlnet_sdxl.py", line 1192, in main
    controlnet, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1202, in prepare
    result = tuple(
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1203, in <genexpr>
    self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1030, in _prepare_one
    return self.prepare_model(obj, device_placement=device_placement)
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1340, in prepare_model
    model = torch.nn.parallel.DistributedDataParallel(
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 688, in __init__
    self._ddp_init_helper(
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 825, in _ddp_init_helper
    self.reducer = dist.Reducer(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.66 GiB (GPU 0; 15.77 GiB total capacity; 11.28 GiB already allocated; 3.52 GiB free; 11.59 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 109689) of binary: /home/ubuntu/anaconda3/envs/mvp/bin/python
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/mvp/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
    args.func(args)
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/commands/launch.py", line 970, in launch_command
    multi_gpu_launcher(args)
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/commands/launch.py", line 646, in multi_gpu_launcher
    distrib_run.run(args)
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/home/ubuntu/src/mvp/backend/train/huggingface_train_scripts/train_controlnet_sdxl.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-04-23_03:22:47
  host      :
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 109689)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

System Info

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.171.04             Driver Version: 535.171.04   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla V100-SXM2-16GB           Off | 00000000:00:1B.0 Off |                    0 |
| N/A   35C    P0              36W / 300W |      0MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2-16GB           Off | 00000000:00:1C.0 Off |                    0 |
| N/A   38C    P0              37W / 300W |      0MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2-16GB           Off | 00000000:00:1D.0 Off |                    0 |
| N/A   34C    P0              38W / 300W |      0MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2-16GB           Off | 00000000:00:1E.0 Off |                    0 |
| N/A   34C    P0              37W / 300W |      0MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |

Who can help?

@sayakpaul @yiyixuxu

asutermo commented 7 months ago

Sure, here's the result

2024-04-23 21:06:22 - train.train - INFO - Running ['accelerate', 'launch', '--multi_gpu', '--num_processes=4', '/home/ubuntu/src/mvp/backend/train/huggingface_train_scripts/train_controlnet_sdxl.py', '--pretrained_model_name_or_path', 'stabilityai/stable-diffusion-xl-base-1.0', '--instance_data_dir', '/tmp/fai_cache/data/data', '--pretrained_vae_model_name_or_path', 'madebyollin/sdxl-vae-fp16-fix', '--output_dir', '/tmp/demo-20240423210622', '--resolution', '512', '--train_batch_size', '1', '--gradient_accumulation_steps', '4', '--learning_rate', '1e-4', '--lr_scheduler', 'constant', '--lr_warmup_steps', '0', '--max_train_steps', '1000', '--checkpointing_steps', '1000', '--seed', '0', '--gradient_checkpointing', '--checkpoints_total_limit', '3', '--use_8bit_adam', '--enable_xformers_memory_efficient_attention', '--set_grads_to_none']
2024-04-23 21:08:51 - root - INFO - Could not find the bitsandbytes CUDA binary at PosixPath('/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117_nocublaslt.so')
Could not find the bitsandbytes CUDA binary at PosixPath('/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117_nocublaslt.so')
Could not find the bitsandbytes CUDA binary at PosixPath('/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117_nocublaslt.so')
Could not find the bitsandbytes CUDA binary at PosixPath('/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117_nocublaslt.so')
Could not find the bitsandbytes CUDA binary at PosixPath('/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117_nocublaslt.so')
04/23/2024 21:06:30 - INFO - __main__ - Distributed environment: MULTI_GPU  Backend: nccl
Num processes: 4
Process index: 2
Local process index: 2
Device: cuda:2

Mixed precision type: fp16

04/23/2024 21:06:30 - INFO - __main__ - Distributed environment: MULTI_GPU  Backend: nccl
Num processes: 4
Process index: 0
Local process index: 0
Device: cuda:0

Mixed precision type: fp16

04/23/2024 21:06:30 - INFO - __main__ - Distributed environment: MULTI_GPU  Backend: nccl
Num processes: 4
Process index: 1
Local process index: 1
Device: cuda:1

Mixed precision type: fp16

04/23/2024 21:06:30 - INFO - __main__ - Distributed environment: MULTI_GPU  Backend: nccl
Num processes: 4
Process index: 3
Local process index: 3
Device: cuda:3

Mixed precision type: fp16

You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
{'variance_type', 'dynamic_thresholding_ratio', 'rescale_betas_zero_snr', 'clip_sample_range', 'thresholding'} was not found in config. Values will be initialized to default values.
{'latents_mean', 'latents_std'} was not found in config. Values will be initialized to default values.
{'dropout', 'reverse_transformer_layers_per_block', 'attention_type'} was not found in config. Values will be initialized to default values.
04/23/2024 21:07:06 - INFO - __main__ - Initializing controlnet weights from unet

Map:   0%|          | 0/29 [00:00<?, ? examples/s]
Map: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 29/29 [00:02<00:00, 13.42 examples/s]
Map: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 29/29 [00:02<00:00, 13.19 examples/s]
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
  File "/home/ubuntu/src/mvp/backend/train/huggingface_train_scripts/train_controlnet_sdxl.py", line 1438, in <module>
  File "/home/ubuntu/src/mvp/backend/train/huggingface_train_scripts/train_controlnet_sdxl.py", line 1438, in <module>
Traceback (most recent call last):
  File "/home/ubuntu/src/mvp/backend/train/huggingface_train_scripts/train_controlnet_sdxl.py", line 1438, in <module>
  File "/home/ubuntu/src/mvp/backend/train/huggingface_train_scripts/train_controlnet_sdxl.py", line 1438, in <module>
            main(args)    main(args)main(args)
main(args)

  File "/home/ubuntu/src/mvp/backend/train/huggingface_train_scripts/train_controlnet_sdxl.py", line 1192, in main
  File "/home/ubuntu/src/mvp/backend/train/huggingface_train_scripts/train_controlnet_sdxl.py", line 1192, in main
  File "/home/ubuntu/src/mvp/backend/train/huggingface_train_scripts/train_controlnet_sdxl.py", line 1192, in main
  File "/home/ubuntu/src/mvp/backend/train/huggingface_train_scripts/train_controlnet_sdxl.py", line 1192, in main
    controlnet, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
        controlnet, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(controlnet, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1202, in prepare

    controlnet, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1202, in prepare

  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1202, in prepare
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1202, in prepare
            result = tuple(result = tuple(result = tuple(

  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1203, in <genexpr>
      File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1203, in <genexpr>
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1203, in <genexpr>
result = tuple(
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1203, in <genexpr>
    self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
    self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1030, in _prepare_one

self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1030, in _prepare_one
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1030, in _prepare_one
    self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1030, in _prepare_one
    return self.prepare_model(obj, device_placement=device_placement)
      File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1340, in prepare_model
return self.prepare_model(obj, device_placement=device_placement)
      File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1340, in prepare_model
return self.prepare_model(obj, device_placement=device_placement)
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1340, in prepare_model
    return self.prepare_model(obj, device_placement=device_placement)
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1340, in prepare_model
    model = torch.nn.parallel.DistributedDataParallel(
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 688, in __init__
    model = torch.nn.parallel.DistributedDataParallel(
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 688, in __init__
    model = torch.nn.parallel.DistributedDataParallel(
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 688, in __init__
    model = torch.nn.parallel.DistributedDataParallel(
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 688, in __init__
    self._ddp_init_helper(
    self._ddp_init_helper(  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 825, in _ddp_init_helper

self._ddp_init_helper(
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 825, in _ddp_init_helper
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 825, in _ddp_init_helper
    self._ddp_init_helper(
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 825, in _ddp_init_helper
    self.reducer = dist.Reducer(
    self.reducer = dist.Reducer(
self.reducer = dist.Reducer(torch.cuda
.OutOfMemoryErrortorch.cuda: .torch.cudaCUDA out of memory. Tried to allocate 4.66 GiB (GPU 0; 15.77 GiB total capacity; 11.28 GiB already allocated; 3.58 GiB free; 11.59 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONFOutOfMemoryError.
: OutOfMemoryErrorCUDA out of memory. Tried to allocate 4.66 GiB (GPU 2; 15.77 GiB total capacity; 11.28 GiB already allocated; 3.53 GiB free; 11.57 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF:
    CUDA out of memory. Tried to allocate 4.66 GiB (GPU 1; 15.77 GiB total capacity; 11.28 GiB already allocated; 3.61 GiB free; 11.57 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONFself.reducer = dist.Reducer(

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.66 GiB (GPU 3; 15.77 GiB total capacity; 11.28 GiB already allocated; 3.53 GiB free; 11.57 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 12126) of binary: /home/ubuntu/anaconda3/envs/mvp/bin/python
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/mvp/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
    args.func(args)
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/commands/launch.py", line 970, in launch_command
    multi_gpu_launcher(args)
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/commands/launch.py", line 646, in multi_gpu_launcher
    distrib_run.run(args)
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/home/ubuntu/src/mvp/backend/train/huggingface_train_scripts/train_controlnet_sdxl.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-04-23_21:08:51
  host      : ip-172-31-28-3.us-west-2.compute.internal
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 12127)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2024-04-23_21:08:51
  host      : ip-172-31-28-3.us-west-2.compute.internal
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 12128)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2024-04-23_21:08:51
  host      : ip-172-31-28-3.us-west-2.compute.internal
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 12129)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-04-23_21:08:51
  host      : ip-172-31-28-3.us-west-2.compute.internal
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 12126)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

2024-04-23 21:08:51 - root - INFO - Command exited with code 1
sayakpaul commented 7 months ago

This could be because of how we perform text encoding.

Could you maybe refer to this script and make adjustments accordingly?

asutermo commented 6 months ago

I tried running just that script to see. The dataset is quite tiny. No luck. CUDA out of memory still for multiple GPUs

 File "/home/ubuntu/src/mvp/backend/train/huggingface_train_scripts/train_controlnet_webdataset_sdxl.py", line 1227, in main
  File "/home/ubuntu/src/mvp/backend/train/huggingface_train_scripts/train_controlnet_webdataset_sdxl.py", line 1227, in main
    controlnet, optimizer, lr_scheduler = accelerator.prepare(controlnet, optimizer, lr_scheduler)
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1202, in prepare
    controlnet, optimizer, lr_scheduler = accelerator.prepare(controlnet, optimizer, lr_scheduler)
      File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1202, in prepare
controlnet, optimizer, lr_scheduler = accelerator.prepare(controlnet, optimizer, lr_scheduler)
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1202, in prepare
    result = tuple(
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1203, in <genexpr>
    result = tuple(
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1203, in <genexpr>
    result = tuple(
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1203, in <genexpr>
    self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1030, in _prepare_one
    self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1030, in _prepare_one
    self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1030, in _prepare_one
    return self.prepare_model(obj, device_placement=device_placement)
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1340, in prepare_model
    return self.prepare_model(obj, device_placement=device_placement)
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1340, in prepare_model
    return self.prepare_model(obj, device_placement=device_placement)
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1340, in prepare_model
    model = torch.nn.parallel.DistributedDataParallel(
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 688, in __init__
    model = torch.nn.parallel.DistributedDataParallel(
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 688, in __init__
    model = torch.nn.parallel.DistributedDataParallel(
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 688, in __init__
    self._ddp_init_helper(
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 825, in _ddp_init_helper
    self._ddp_init_helper(
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 825, in _ddp_init_helper
    self._ddp_init_helper(
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 825, in _ddp_init_helper
    self.reducer = dist.Reducer(
torch.cuda.    OutOfMemoryErrorself.reducer = dist.Reducer(: 
CUDA out of memory. Tried to allocate 4.66 GiB (GPU 2; 15.77 GiB total capacity; 11.28 GiB already allocated; 3.53 GiB free; 11.57 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.66 GiB (GPU 3; 15.77 GiB total capacity; 11.28 GiB already allocated; 3.53 GiB free; 11.57 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
sayakpaul commented 6 months ago

Did you only change the dataloader to WebDataset? I am afraid only that is probably not going to work. In https://github.com/huggingface/diffusers/blob/main/examples/research_projects/controlnet/train_controlnet_webdataset.py, we also compute the caption embeddings when doing the epochs and NOT ahead of it.

asutermo commented 6 months ago

Ahh I understand now. I will give that a try. FWIW, I also looked at using deepspeed and other routes, all with the same result. But I'll pursue the embeddings change here.

asutermo commented 6 months ago

So, we're going to have large datasets anyways. I opted to swap to webdataset. I tried running the webdataset controlnet script, and still it fails. The test dataset itself is 12 images.

Seems, the embeddings change had no tangible effect here.

Traceback (most recent call last): 
Traceback (most recent call last): 
File "/home/ubuntu/src/mvp/backend/train/huggingface_train_scripts/train_controlnet_webdataset_sdxl.py", line 1462, in <module> 
File "/home/ubuntu/src/mvp/backend/train/huggingface_train_scripts/train_controlnet_webdataset_sdxl.py", line 1462, in <module> 
Traceback (most recent call last): 
Traceback (most recent call last): 
File "/home/ubuntu/src/mvp/backend/train/huggingface_train_scripts/train_controlnet_webdataset_sdxl.py", line 1462, in <module> 
File "/home/ubuntu/src/mvp/backend/train/huggingface_train_scripts/train_controlnet_webdataset_sdxl.py", line 1462, in <module> 
main(args)main(args)main(args)main(args) 
File "/home/ubuntu/src/mvp/backend/train/huggingface_train_scripts/train_controlnet_webdataset_sdxl.py", line 1221, in main 
File "/home/ubuntu/src/mvp/backend/train/huggingface_train_scripts/train_controlnet_webdataset_sdxl.py", line 1221, in main 
File "/home/ubuntu/src/mvp/backend/train/huggingface_train_scripts/train_controlnet_webdataset_sdxl.py", line 1221, in main 
File "/home/ubuntu/src/mvp/backend/train/huggingface_train_scripts/train_controlnet_webdataset_sdxl.py", line 1221, in main 
controlnet, optimizer, lr_scheduler = accelerator.prepare(controlnet, optimizer, lr_scheduler)controlnet, optimizer, lr_scheduler = accelerator.prepare(controlnet, optimizer, lr_scheduler) 
File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1202, in prepare 
File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1202, in prepare 
controlnet, optimizer, lr_scheduler = accelerator.prepare(controlnet, optimizer, lr_scheduler)controlnet, optimizer, lr_scheduler = accelerator.prepare(controlnet, optimizer, lr_scheduler) 
File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1202, in prepare 
File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1202, in prepare 
result = tuple(result = tuple( 
File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1203, in <genexpr> 
File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1203, in <genexpr> 
result = tuple( 
File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1203, in <genexpr> 
result = tuple( 
File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1203, in <genexpr> 
self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement) 
File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1030, in _prepare_one 
File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1030, in _prepare_one 
self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement) 
File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1030, in _prepare_one 
self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement) 
File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1030, in _prepare_one 
return self.prepare_model(obj, device_placement=device_placement)return self.prepare_model(obj, device_placement=device_placement)return self.prepare_model(obj, device_placement=device_placement) 
File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1340, in prepare_model 
File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1340, in prepare_model 
File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1340, in prepare_model 
return self.prepare_model(obj, device_placement=device_placement) 
File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1340, in prepare_model 
model = torch.nn.parallel.DistributedDataParallel(model = torch.nn.parallel.DistributedDataParallel(model = torch.nn.parallel.DistributedDataParallel( 
File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 688, in __init__ 
File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 688, in __init__ 
File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 688, in __init__ 
model = torch.nn.parallel.DistributedDataParallel( 
File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 688, in __init__ 
self._ddp_init_helper(self._ddp_init_helper(self._ddp_init_helper(self._ddp_init_helper( 
File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 825, in _ddp_init_helper 
File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 825, in _ddp_init_helper 
File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 825, in _ddp_init_helper 
self.reducer = dist.Reducer(self.reducer = dist.Reducer(self.reducer = dist.Reducer(self.reducer = dist.Reducer( 
torch.cudatorch.cudatorch.cuda.torch.cuda..OutOfMemoryError.OutOfMemoryErrorOutOfMemoryError: OutOfMemoryError: : : CUDA out of memory. Tried to allocate 4.66 GiB (GPU 1; 15.77 GiB total capacity; 11.28 GiB already allocated; 3.61 GiB free; 11.57 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONFCUDA out of memory. Tried to allocate 4.66 GiB (GPU 2; 15.77 GiB total capacity; 11.28 GiB already allocated; 3.53 GiB free; 11.57 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONFCUDA out of memory. Tried to allocate 4.66 GiB (GPU 3; 15.77 GiB total capacity; 11.28 GiB already allocated; 3.53 GiB free; 11.57 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONFCUDA out of memory. Tried to allocate 4.66 GiB (GPU 0; 15.77 GiB total capacity; 11.28 GiB already allocated; 3.61 GiB free; 11.57 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF 
asutermo commented 6 months ago

I switched to SageMaker (p3.8xlarge) here to see if there's any availability. Still failing. Guidance requested.

!accelerate launch --mixed_precision="fp16" --num_processes=4 --multi_gpu scripts/train_controlnet_webdataset.py \
    --pretrained_model_name_or_path "stabilityai/stable-diffusion-xl-base-1.0" \
    --train_shards_path_or_url data \
    --eval_shards_path_or_url data \
    --output_dir output \
    --pretrained_vae_model_name_or_path "madebyollin/sdxl-vae-fp16-fix" \
    --controlnet_model_name_or_path "diffusers/controlnet-canny-sdxl-1.0" \
    --resolution 512 \
    --train_batch_size 1 \
    --lr_scheduler constant \
    --lr_warmup_steps 0 \
    --max_train_samples 1000 \
    --seed 0 \
    --mixed_precision="fp16" \
    --gradient_accumulation_steps=4 \
    --use_8bit_adam \
    --enable_xformers_memory_efficient_attention
[W Utils.hpp:135] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function getCvarInt)
[W Utils.hpp:135] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function getCvarInt)
[W Utils.hpp:135] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function getCvarInt)
[W Utils.hpp:135] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function getCvarInt)
/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/accelerate/accelerator.py:401: UserWarning: `log_with=tensorboard` was passed but no supported trackers are currently installed.
  warnings.warn(f"`log_with={log_with}` was passed but no supported trackers are currently installed.")
05/06/2024 17:09:09 - INFO - __main__ - Distributed environment: MULTI_GPU  Backend: nccl
Num processes: 4
Process index: 0
Local process index: 0
Device: cuda:0

Mixed precision type: fp16

/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/accelerate/accelerator.py:401: UserWarning: `log_with=tensorboard` was passed but no supported trackers are currently installed.
  warnings.warn(f"`log_with={log_with}` was passed but no supported trackers are currently installed.")
/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/accelerate/accelerator.py:401: UserWarning: `log_with=tensorboard` was passed but no supported trackers are currently installed.
  warnings.warn(f"`log_with={log_with}` was passed but no supported trackers are currently installed.")
05/06/2024 17:09:09 - INFO - __main__ - Distributed environment: MULTI_GPU  Backend: nccl
Num processes: 4
Process index: 2
Local process index: 2
Device: cuda:2

Mixed precision type: fp16

05/06/2024 17:09:09 - INFO - __main__ - Distributed environment: MULTI_GPU  Backend: nccl
Num processes: 4
Process index: 1
Local process index: 1
Device: cuda:1

Mixed precision type: fp16

/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/accelerate/accelerator.py:401: UserWarning: `log_with=tensorboard` was passed but no supported trackers are currently installed.
  warnings.warn(f"`log_with={log_with}` was passed but no supported trackers are currently installed.")
/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
05/06/2024 17:09:09 - INFO - __main__ - Distributed environment: MULTI_GPU  Backend: nccl
Num processes: 4
Process index: 3
Local process index: 3
Device: cuda:3

Mixed precision type: fp16

/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
{'timestep_type', 'sigma_min', 'sigma_max', 'rescale_betas_zero_snr'} was not found in config. Values will be initialized to default values.
{'latents_mean', 'latents_std'} was not found in config. Values will be initialized to default values.
{'attention_type', 'reverse_transformer_layers_per_block', 'dropout'} was not found in config. Values will be initialized to default values.
05/06/2024 17:09:14 - INFO - __main__ - Loading existing controlnet weights
{'mid_block_type'} was not found in config. Values will be initialized to default values.
[rank3]: Traceback (most recent call last):
[rank3]:   File "/home/ec2-user/SageMaker/scripts/train_controlnet_webdataset.py", line 1472, in <module>
[rank3]:     main(args)
[rank3]:   File "/home/ec2-user/SageMaker/scripts/train_controlnet_webdataset.py", line 1231, in main
[rank3]:     controlnet, optimizer, lr_scheduler = accelerator.prepare(controlnet, optimizer, lr_scheduler)
[rank3]:   File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/accelerate/accelerator.py", line 1292, in prepare
[rank3]:     result = tuple(
[rank3]:   File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/accelerate/accelerator.py", line 1293, in <genexpr>
[rank3]:     self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
[rank3]:   File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/accelerate/accelerator.py", line 1169, in _prepare_one
[rank3]:     return self.prepare_model(obj, device_placement=device_placement)
[rank3]:   File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/accelerate/accelerator.py", line 1428, in prepare_model
[rank3]:     model = torch.nn.parallel.DistributedDataParallel(
[rank3]:   File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 812, in __init__
[rank3]:     self._ddp_init_helper(
[rank3]:   File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1152, in _ddp_init_helper
[rank3]:     self.reducer = dist.Reducer(
[rank3]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.66 GiB. GPU  has a total capacity of 15.77 GiB of which 3.51 GiB is free. Including non-PyTorch memory, this process has 12.26 GiB memory in use. Of the allocated memory 11.28 GiB is allocated by PyTorch, and 299.43 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[rank1]: Traceback (most recent call last):
[rank1]:   File "/home/ec2-user/SageMaker/scripts/train_controlnet_webdataset.py", line 1472, in <module>
[rank1]:     main(args)
[rank1]:   File "/home/ec2-user/SageMaker/scripts/train_controlnet_webdataset.py", line 1231, in main
[rank1]:     controlnet, optimizer, lr_scheduler = accelerator.prepare(controlnet, optimizer, lr_scheduler)
[rank1]:   File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/accelerate/accelerator.py", line 1292, in prepare
[rank1]:     result = tuple(
[rank1]:   File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/accelerate/accelerator.py", line 1293, in <genexpr>
[rank1]:     self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
[rank1]:   File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/accelerate/accelerator.py", line 1169, in _prepare_one
[rank1]:     return self.prepare_model(obj, device_placement=device_placement)
[rank1]:   File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/accelerate/accelerator.py", line 1428, in prepare_model
[rank1]:     model = torch.nn.parallel.DistributedDataParallel(
[rank1]:   File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 812, in __init__
[rank1]:     self._ddp_init_helper(
[rank1]:   File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1152, in _ddp_init_helper
[rank1]:     self.reducer = dist.Reducer(
[rank1]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.66 GiB. GPU  has a total capacity of 15.77 GiB of which 3.53 GiB is free. Including non-PyTorch memory, this process has 12.24 GiB memory in use. Of the allocated memory 11.28 GiB is allocated by PyTorch, and 299.43 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[rank2]: Traceback (most recent call last):
[rank2]:   File "/home/ec2-user/SageMaker/scripts/train_controlnet_webdataset.py", line 1472, in <module>
[rank2]:     main(args)
[rank2]:   File "/home/ec2-user/SageMaker/scripts/train_controlnet_webdataset.py", line 1231, in main
[rank2]:     controlnet, optimizer, lr_scheduler = accelerator.prepare(controlnet, optimizer, lr_scheduler)
[rank2]:   File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/accelerate/accelerator.py", line 1292, in prepare
[rank2]:     result = tuple(
[rank2]:   File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/accelerate/accelerator.py", line 1293, in <genexpr>
[rank2]:     self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
[rank2]:   File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/accelerate/accelerator.py", line 1169, in _prepare_one
[rank2]:     return self.prepare_model(obj, device_placement=device_placement)
[rank2]:   File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/accelerate/accelerator.py", line 1428, in prepare_model
[rank2]:     model = torch.nn.parallel.DistributedDataParallel(
[rank2]:   File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 812, in __init__
[rank2]:     self._ddp_init_helper(
[rank2]:   File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1152, in _ddp_init_helper
[rank2]:     self.reducer = dist.Reducer(
[rank2]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.66 GiB. GPU  has a total capacity of 15.77 GiB of which 3.49 GiB is free. Including non-PyTorch memory, this process has 12.28 GiB memory in use. Of the allocated memory 11.28 GiB is allocated by PyTorch, and 299.43 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/ec2-user/SageMaker/scripts/train_controlnet_webdataset.py", line 1472, in <module>
[rank0]:     main(args)
[rank0]:   File "/home/ec2-user/SageMaker/scripts/train_controlnet_webdataset.py", line 1231, in main
[rank0]:     controlnet, optimizer, lr_scheduler = accelerator.prepare(controlnet, optimizer, lr_scheduler)
[rank0]:   File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/accelerate/accelerator.py", line 1292, in prepare
[rank0]:     result = tuple(
[rank0]:   File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/accelerate/accelerator.py", line 1293, in <genexpr>
[rank0]:     self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
[rank0]:   File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/accelerate/accelerator.py", line 1169, in _prepare_one
[rank0]:     return self.prepare_model(obj, device_placement=device_placement)
[rank0]:   File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/accelerate/accelerator.py", line 1428, in prepare_model
[rank0]:     model = torch.nn.parallel.DistributedDataParallel(
[rank0]:   File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 812, in __init__
[rank0]:     self._ddp_init_helper(
[rank0]:   File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1152, in _ddp_init_helper
[rank0]:     self.reducer = dist.Reducer(
[rank0]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.66 GiB. GPU 
E0506 17:09:25.526000 140715603318592 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 65009) of binary: /home/ec2-user/anaconda3/envs/pytorch_p310/bin/python3.10
Traceback (most recent call last):
  File "/home/ec2-user/anaconda3/envs/pytorch_p310/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 46, in main
    args.func(args)
  File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1073, in launch_command
    multi_gpu_launcher(args)
  File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/accelerate/commands/launch.py", line 718, in multi_gpu_launcher
    distrib_run.run(args)
  File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
scripts/train_controlnet_webdataset.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-05-06_17:09:25
  host      : ip-172-16-92-209.us-west-2.compute.internal
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 65010)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2024-05-06_17:09:25
  host      : ip-172-16-92-209.us-west-2.compute.internal
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 65011)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2024-05-06_17:09:25
  host      : ip-172-16-92-209.us-west-2.compute.internal
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 65012)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-05-06_17:09:25
  host      : ip-172-16-92-209.us-west-2.compute.internal
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 65009)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
sayakpaul commented 6 months ago

Where is text being encoded? Also the training script you're using -- did you perform any major modifications to it?

asutermo commented 6 months ago

Where is text being encoded? Also the training script you're using -- did you perform any major modifications to it?

In this case I just used the webdata set controlnet scriptwith no changes.

I created a 12 frame w/ txt file tarball (using webdataset's api) to serve as my dataset.

xiankgx commented 6 months ago

I think the 16 GB RAM on the V100's are simple not sufficient. You were previously on an A100.

DuShunpeng commented 5 months ago

Describe the bug

Hi there. I've reliably used the train_controlnet_sdxl.py on single gpu on GCP (A100 - 40 GB). I have had to switch to AWS and am presently using a p3.8xlarge which has 4 V100 gpus w/ 64 GB GPU memory total.

Whenever I run my workflow on AWS I get a Cuda out of memory error just loading the dataset. I built bitsandbytes following HuggingFace's documentation. Again, this works on single GPU just fine.

Reproduction

Possible issues:

  • Could not find the bitsandbytes CUDA binary at PosixPath('/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117_nocublaslt.so'). It does seem it uses CUDA though as evidenced down below

accelerate command 1:

accelerate launch --multi_gpu /home/ubuntu/src/mvp/backend/train/huggingface_train_scripts/train_controlnet_sdxl.py --pretrained_model_name_or_path stabilityai/stable-diffusion-xl-base-1.0 --instance_data_dir /tmp/fai_cache/data/data --pretrained_vae_model_name_or_path madebyollin/sdxl-vae-fp16-fix --output_dir /tmp/demo-20240423032204 --resolution 512 --train_batch_size 1 --gradient_accumulation_steps 4 --learning_rate 1e-4 --lr_scheduler constant --lr_warmup_steps 0 --max_train_steps 1000 --checkpointing_steps 1000 --seed 0 --gradient_checkpointing --checkpoints_total_limit 3 --use_8bit_adam --enable_xformers_memory_efficient_attention --set_grads_to_none --dataloader_num_workers 4'

accelerate command 2:

accelerate launch --multi_gpu /home/ubuntu/src/mvp/backend/train/huggingface_train_scripts/train_controlnet_sdxl.py --pretrained_model_name_or_path stabilityai/stable-diffusion-xl-base-1.0 --instance_data_dir /tmp/fai_cache/data/data --pretrained_vae_model_name_or_path madebyollin/sdxl-vae-fp16-fix --output_dir /tmp/demo-20240423032204 --resolution 512 --train_batch_size 1 --gradient_accumulation_steps 4 --learning_rate 1e-4 --lr_scheduler constant --lr_warmup_steps 0 --max_train_steps 1000 --checkpointing_steps 1000 --seed 0 --gradient_checkpointing --checkpoints_total_limit 3 --use_8bit_adam --enable_xformers_memory_efficient_attention --set_grads_to_none

Logs

2024-04-23 03:22:04 - train.train - INFO - Running ['accelerate', 'launch', '--multi_gpu', '/home/ubuntu/src/mvp/backend/train/huggingface_train_scripts/train_controlnet_sdxl.py', '--pretrained_model_name_or_path', 'stabilityai/stable-diffusion-xl-base-1.0', '--instance_data_dir', '/tmp/fai_cache/data/data', '--pretrained_vae_model_name_or_path', 'madebyollin/sdxl-vae-fp16-fix', '--output_dir', '/tmp/demo-20240423032204', '--resolution', '512', '--train_batch_size', '1', '--gradient_accumulation_steps', '4', '--learning_rate', '1e-4', '--lr_scheduler', 'constant', '--lr_warmup_steps', '0', '--max_train_steps', '1000', '--checkpointing_steps', '1000', '--seed', '0', '--gradient_checkpointing', '--checkpoints_total_limit', '3', '--use_8bit_adam', '--enable_xformers_memory_efficient_attention', '--set_grads_to_none', '--dataloader_num_workers', '4']
2024-04-23 03:22:48 - root - INFO - Could not find the bitsandbytes CUDA binary at PosixPath('/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117_nocublaslt.so')
Could not find the bitsandbytes CUDA binary at PosixPath('/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117_nocublaslt.so')
04/23/2024 03:22:12 - INFO - __main__ - Distributed environment: MULTI_GPU  Backend: nccl
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda:0

Mixed precision type: fp16

You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
{'variance_type', 'clip_sample_range', 'dynamic_thresholding_ratio', 'rescale_betas_zero_snr', 'thresholding'} was not found in config. Values will be initialized to default values.
{'latents_mean', 'latents_std'} was not found in config. Values will be initialized to default values.
{'dropout', 'attention_type', 'reverse_transformer_layers_per_block'} was not found in config. Values will be initialized to default values.
04/23/2024 03:22:23 - INFO - __main__ - Initializing controlnet weights from unet

Map:   0%|          | 0/29 [00:00<?, ? examples/s]
Map: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 29/29 [00:00<00:00, 50.84 examples/s]
Map: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 29/29 [00:00<00:00, 49.14 examples/s]
Traceback (most recent call last):
  File "/home/ubuntu/src/mvp/backend/train/huggingface_train_scripts/train_controlnet_sdxl.py", line 1438, in <module>
    main(args)
  File "/home/ubuntu/src/mvp/backend/train/huggingface_train_scripts/train_controlnet_sdxl.py", line 1192, in main
    controlnet, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1202, in prepare
    result = tuple(
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1203, in <genexpr>
    self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1030, in _prepare_one
    return self.prepare_model(obj, device_placement=device_placement)
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1340, in prepare_model
    model = torch.nn.parallel.DistributedDataParallel(
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 688, in __init__
    self._ddp_init_helper(
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 825, in _ddp_init_helper
    self.reducer = dist.Reducer(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.66 GiB (GPU 0; 15.77 GiB total capacity; 11.28 GiB already allocated; 3.52 GiB free; 11.59 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 109689) of binary: /home/ubuntu/anaconda3/envs/mvp/bin/python
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/mvp/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
    args.func(args)
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/commands/launch.py", line 970, in launch_command
    multi_gpu_launcher(args)
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/commands/launch.py", line 646, in multi_gpu_launcher
    distrib_run.run(args)
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/home/ubuntu/src/mvp/backend/train/huggingface_train_scripts/train_controlnet_sdxl.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-04-23_03:22:47
  host      :
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 109689)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

System Info

  • diffusers version: 0.28.0.dev0
  • Platform: Linux-6.5.0-1018-aws-x86_64-with-glibc2.35
  • Python version: 3.10.6
  • PyTorch version (GPU?): 2.0.1+cu117 (True)
  • Huggingface_hub version: 0.22.2
  • Transformers version: 4.31.0
  • Accelerate version: 0.21.0
  • xFormers version: 0.0.22
  • Using GPU in script?: yes
  • Using distributed or parallel set-up in script?: yes
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.171.04             Driver Version: 535.171.04   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla V100-SXM2-16GB           Off | 00000000:00:1B.0 Off |                    0 |
| N/A   35C    P0              36W / 300W |      0MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2-16GB           Off | 00000000:00:1C.0 Off |                    0 |
| N/A   38C    P0              37W / 300W |      0MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2-16GB           Off | 00000000:00:1D.0 Off |                    0 |
| N/A   34C    P0              38W / 300W |      0MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2-16GB           Off | 00000000:00:1E.0 Off |                    0 |
| N/A   34C    P0              37W / 300W |      0MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |

Who can help?

@sayakpaul @yiyixuxu

Describe the bug

Hi there. I've reliably used the train_controlnet_sdxl.py on single gpu on GCP (A100 - 40 GB). I have had to switch to AWS and am presently using a p3.8xlarge which has 4 V100 gpus w/ 64 GB GPU memory total.

Whenever I run my workflow on AWS I get a Cuda out of memory error just loading the dataset. I built bitsandbytes following HuggingFace's documentation. Again, this works on single GPU just fine.

Reproduction

Possible issues:

  • Could not find the bitsandbytes CUDA binary at PosixPath('/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117_nocublaslt.so'). It does seem it uses CUDA though as evidenced down below

accelerate command 1:

accelerate launch --multi_gpu /home/ubuntu/src/mvp/backend/train/huggingface_train_scripts/train_controlnet_sdxl.py --pretrained_model_name_or_path stabilityai/stable-diffusion-xl-base-1.0 --instance_data_dir /tmp/fai_cache/data/data --pretrained_vae_model_name_or_path madebyollin/sdxl-vae-fp16-fix --output_dir /tmp/demo-20240423032204 --resolution 512 --train_batch_size 1 --gradient_accumulation_steps 4 --learning_rate 1e-4 --lr_scheduler constant --lr_warmup_steps 0 --max_train_steps 1000 --checkpointing_steps 1000 --seed 0 --gradient_checkpointing --checkpoints_total_limit 3 --use_8bit_adam --enable_xformers_memory_efficient_attention --set_grads_to_none --dataloader_num_workers 4'

accelerate command 2:

accelerate launch --multi_gpu /home/ubuntu/src/mvp/backend/train/huggingface_train_scripts/train_controlnet_sdxl.py --pretrained_model_name_or_path stabilityai/stable-diffusion-xl-base-1.0 --instance_data_dir /tmp/fai_cache/data/data --pretrained_vae_model_name_or_path madebyollin/sdxl-vae-fp16-fix --output_dir /tmp/demo-20240423032204 --resolution 512 --train_batch_size 1 --gradient_accumulation_steps 4 --learning_rate 1e-4 --lr_scheduler constant --lr_warmup_steps 0 --max_train_steps 1000 --checkpointing_steps 1000 --seed 0 --gradient_checkpointing --checkpoints_total_limit 3 --use_8bit_adam --enable_xformers_memory_efficient_attention --set_grads_to_none

Logs

2024-04-23 03:22:04 - train.train - INFO - Running ['accelerate', 'launch', '--multi_gpu', '/home/ubuntu/src/mvp/backend/train/huggingface_train_scripts/train_controlnet_sdxl.py', '--pretrained_model_name_or_path', 'stabilityai/stable-diffusion-xl-base-1.0', '--instance_data_dir', '/tmp/fai_cache/data/data', '--pretrained_vae_model_name_or_path', 'madebyollin/sdxl-vae-fp16-fix', '--output_dir', '/tmp/demo-20240423032204', '--resolution', '512', '--train_batch_size', '1', '--gradient_accumulation_steps', '4', '--learning_rate', '1e-4', '--lr_scheduler', 'constant', '--lr_warmup_steps', '0', '--max_train_steps', '1000', '--checkpointing_steps', '1000', '--seed', '0', '--gradient_checkpointing', '--checkpoints_total_limit', '3', '--use_8bit_adam', '--enable_xformers_memory_efficient_attention', '--set_grads_to_none', '--dataloader_num_workers', '4']
2024-04-23 03:22:48 - root - INFO - Could not find the bitsandbytes CUDA binary at PosixPath('/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117_nocublaslt.so')
Could not find the bitsandbytes CUDA binary at PosixPath('/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117_nocublaslt.so')
04/23/2024 03:22:12 - INFO - __main__ - Distributed environment: MULTI_GPU  Backend: nccl
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda:0

Mixed precision type: fp16

You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
{'variance_type', 'clip_sample_range', 'dynamic_thresholding_ratio', 'rescale_betas_zero_snr', 'thresholding'} was not found in config. Values will be initialized to default values.
{'latents_mean', 'latents_std'} was not found in config. Values will be initialized to default values.
{'dropout', 'attention_type', 'reverse_transformer_layers_per_block'} was not found in config. Values will be initialized to default values.
04/23/2024 03:22:23 - INFO - __main__ - Initializing controlnet weights from unet

Map:   0%|          | 0/29 [00:00<?, ? examples/s]
Map: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 29/29 [00:00<00:00, 50.84 examples/s]
Map: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 29/29 [00:00<00:00, 49.14 examples/s]
Traceback (most recent call last):
  File "/home/ubuntu/src/mvp/backend/train/huggingface_train_scripts/train_controlnet_sdxl.py", line 1438, in <module>
    main(args)
  File "/home/ubuntu/src/mvp/backend/train/huggingface_train_scripts/train_controlnet_sdxl.py", line 1192, in main
    controlnet, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1202, in prepare
    result = tuple(
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1203, in <genexpr>
    self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1030, in _prepare_one
    return self.prepare_model(obj, device_placement=device_placement)
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1340, in prepare_model
    model = torch.nn.parallel.DistributedDataParallel(
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 688, in __init__
    self._ddp_init_helper(
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 825, in _ddp_init_helper
    self.reducer = dist.Reducer(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.66 GiB (GPU 0; 15.77 GiB total capacity; 11.28 GiB already allocated; 3.52 GiB free; 11.59 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 109689) of binary: /home/ubuntu/anaconda3/envs/mvp/bin/python
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/mvp/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
    args.func(args)
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/commands/launch.py", line 970, in launch_command
    multi_gpu_launcher(args)
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/commands/launch.py", line 646, in multi_gpu_launcher
    distrib_run.run(args)
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/home/ubuntu/src/mvp/backend/train/huggingface_train_scripts/train_controlnet_sdxl.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-04-23_03:22:47
  host      :
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 109689)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

System Info

  • diffusers version: 0.28.0.dev0
  • Platform: Linux-6.5.0-1018-aws-x86_64-with-glibc2.35
  • Python version: 3.10.6
  • PyTorch version (GPU?): 2.0.1+cu117 (True)
  • Huggingface_hub version: 0.22.2
  • Transformers version: 4.31.0
  • Accelerate version: 0.21.0
  • xFormers version: 0.0.22
  • Using GPU in script?: yes
  • Using distributed or parallel set-up in script?: yes
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.171.04             Driver Version: 535.171.04   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla V100-SXM2-16GB           Off | 00000000:00:1B.0 Off |                    0 |
| N/A   35C    P0              36W / 300W |      0MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2-16GB           Off | 00000000:00:1C.0 Off |                    0 |
| N/A   38C    P0              37W / 300W |      0MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2-16GB           Off | 00000000:00:1D.0 Off |                    0 |
| N/A   34C    P0              38W / 300W |      0MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2-16GB           Off | 00000000:00:1E.0 Off |                    0 |
| N/A   34C    P0              37W / 300W |      0MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |

Who can help?

@sayakpaul @yiyixuxu

hi, @asutermo :

i'm not sure that "35GB when setting --train_batch_size=1 and --resolution=1024." is true or not, but pls try as below: 1、make sure that --use_8bit_adam --enable_xformers_memory_efficient_attention --set_grads_to_none 2、find "optimizer_class = bnb.optim.AdamW8bit" => "bnb.optim.PagedAdamW8bit " . train_controlnet_sdxl.py or whatever you are using

Good luck!

github-actions[bot] commented 2 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.