Open asutermo opened 7 months ago
Sure, here's the result
2024-04-23 21:06:22 - train.train - INFO - Running ['accelerate', 'launch', '--multi_gpu', '--num_processes=4', '/home/ubuntu/src/mvp/backend/train/huggingface_train_scripts/train_controlnet_sdxl.py', '--pretrained_model_name_or_path', 'stabilityai/stable-diffusion-xl-base-1.0', '--instance_data_dir', '/tmp/fai_cache/data/data', '--pretrained_vae_model_name_or_path', 'madebyollin/sdxl-vae-fp16-fix', '--output_dir', '/tmp/demo-20240423210622', '--resolution', '512', '--train_batch_size', '1', '--gradient_accumulation_steps', '4', '--learning_rate', '1e-4', '--lr_scheduler', 'constant', '--lr_warmup_steps', '0', '--max_train_steps', '1000', '--checkpointing_steps', '1000', '--seed', '0', '--gradient_checkpointing', '--checkpoints_total_limit', '3', '--use_8bit_adam', '--enable_xformers_memory_efficient_attention', '--set_grads_to_none']
2024-04-23 21:08:51 - root - INFO - Could not find the bitsandbytes CUDA binary at PosixPath('/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117_nocublaslt.so')
Could not find the bitsandbytes CUDA binary at PosixPath('/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117_nocublaslt.so')
Could not find the bitsandbytes CUDA binary at PosixPath('/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117_nocublaslt.so')
Could not find the bitsandbytes CUDA binary at PosixPath('/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117_nocublaslt.so')
Could not find the bitsandbytes CUDA binary at PosixPath('/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117_nocublaslt.so')
04/23/2024 21:06:30 - INFO - __main__ - Distributed environment: MULTI_GPU Backend: nccl
Num processes: 4
Process index: 2
Local process index: 2
Device: cuda:2
Mixed precision type: fp16
04/23/2024 21:06:30 - INFO - __main__ - Distributed environment: MULTI_GPU Backend: nccl
Num processes: 4
Process index: 0
Local process index: 0
Device: cuda:0
Mixed precision type: fp16
04/23/2024 21:06:30 - INFO - __main__ - Distributed environment: MULTI_GPU Backend: nccl
Num processes: 4
Process index: 1
Local process index: 1
Device: cuda:1
Mixed precision type: fp16
04/23/2024 21:06:30 - INFO - __main__ - Distributed environment: MULTI_GPU Backend: nccl
Num processes: 4
Process index: 3
Local process index: 3
Device: cuda:3
Mixed precision type: fp16
You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
{'variance_type', 'dynamic_thresholding_ratio', 'rescale_betas_zero_snr', 'clip_sample_range', 'thresholding'} was not found in config. Values will be initialized to default values.
{'latents_mean', 'latents_std'} was not found in config. Values will be initialized to default values.
{'dropout', 'reverse_transformer_layers_per_block', 'attention_type'} was not found in config. Values will be initialized to default values.
04/23/2024 21:07:06 - INFO - __main__ - Initializing controlnet weights from unet
Map: 0%| | 0/29 [00:00<?, ? examples/s]
Map: 100%|ββββββββββ| 29/29 [00:02<00:00, 13.42 examples/s]
Map: 100%|ββββββββββ| 29/29 [00:02<00:00, 13.19 examples/s]
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
File "/home/ubuntu/src/mvp/backend/train/huggingface_train_scripts/train_controlnet_sdxl.py", line 1438, in <module>
File "/home/ubuntu/src/mvp/backend/train/huggingface_train_scripts/train_controlnet_sdxl.py", line 1438, in <module>
Traceback (most recent call last):
File "/home/ubuntu/src/mvp/backend/train/huggingface_train_scripts/train_controlnet_sdxl.py", line 1438, in <module>
File "/home/ubuntu/src/mvp/backend/train/huggingface_train_scripts/train_controlnet_sdxl.py", line 1438, in <module>
main(args) main(args)main(args)
main(args)
File "/home/ubuntu/src/mvp/backend/train/huggingface_train_scripts/train_controlnet_sdxl.py", line 1192, in main
File "/home/ubuntu/src/mvp/backend/train/huggingface_train_scripts/train_controlnet_sdxl.py", line 1192, in main
File "/home/ubuntu/src/mvp/backend/train/huggingface_train_scripts/train_controlnet_sdxl.py", line 1192, in main
File "/home/ubuntu/src/mvp/backend/train/huggingface_train_scripts/train_controlnet_sdxl.py", line 1192, in main
controlnet, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
controlnet, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(controlnet, optimizer, train_dataloader, lr_scheduler = accelerator.prepare( File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1202, in prepare
controlnet, optimizer, train_dataloader, lr_scheduler = accelerator.prepare( File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1202, in prepare
File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1202, in prepare
File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1202, in prepare
result = tuple(result = tuple(result = tuple(
File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1203, in <genexpr>
File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1203, in <genexpr>
File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1203, in <genexpr>
result = tuple(
File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1203, in <genexpr>
self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement) File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1030, in _prepare_one
self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1030, in _prepare_one
File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1030, in _prepare_one
self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1030, in _prepare_one
return self.prepare_model(obj, device_placement=device_placement)
File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1340, in prepare_model
return self.prepare_model(obj, device_placement=device_placement)
File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1340, in prepare_model
return self.prepare_model(obj, device_placement=device_placement)
File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1340, in prepare_model
return self.prepare_model(obj, device_placement=device_placement)
File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1340, in prepare_model
model = torch.nn.parallel.DistributedDataParallel(
File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 688, in __init__
model = torch.nn.parallel.DistributedDataParallel(
File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 688, in __init__
model = torch.nn.parallel.DistributedDataParallel(
File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 688, in __init__
model = torch.nn.parallel.DistributedDataParallel(
File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 688, in __init__
self._ddp_init_helper(
self._ddp_init_helper( File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 825, in _ddp_init_helper
self._ddp_init_helper(
File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 825, in _ddp_init_helper
File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 825, in _ddp_init_helper
self._ddp_init_helper(
File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 825, in _ddp_init_helper
self.reducer = dist.Reducer(
self.reducer = dist.Reducer(
self.reducer = dist.Reducer(torch.cuda
.OutOfMemoryErrortorch.cuda: .torch.cudaCUDA out of memory. Tried to allocate 4.66 GiB (GPU 0; 15.77 GiB total capacity; 11.28 GiB already allocated; 3.58 GiB free; 11.59 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONFOutOfMemoryError.
: OutOfMemoryErrorCUDA out of memory. Tried to allocate 4.66 GiB (GPU 2; 15.77 GiB total capacity; 11.28 GiB already allocated; 3.53 GiB free; 11.57 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF:
CUDA out of memory. Tried to allocate 4.66 GiB (GPU 1; 15.77 GiB total capacity; 11.28 GiB already allocated; 3.61 GiB free; 11.57 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONFself.reducer = dist.Reducer(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.66 GiB (GPU 3; 15.77 GiB total capacity; 11.28 GiB already allocated; 3.53 GiB free; 11.57 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 12126) of binary: /home/ubuntu/anaconda3/envs/mvp/bin/python
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/mvp/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
args.func(args)
File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/commands/launch.py", line 970, in launch_command
multi_gpu_launcher(args)
File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/commands/launch.py", line 646, in multi_gpu_launcher
distrib_run.run(args)
File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/home/ubuntu/src/mvp/backend/train/huggingface_train_scripts/train_controlnet_sdxl.py FAILED
------------------------------------------------------------
Failures:
[1]:
time : 2024-04-23_21:08:51
host : ip-172-31-28-3.us-west-2.compute.internal
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 12127)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time : 2024-04-23_21:08:51
host : ip-172-31-28-3.us-west-2.compute.internal
rank : 2 (local_rank: 2)
exitcode : 1 (pid: 12128)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
time : 2024-04-23_21:08:51
host : ip-172-31-28-3.us-west-2.compute.internal
rank : 3 (local_rank: 3)
exitcode : 1 (pid: 12129)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-04-23_21:08:51
host : ip-172-31-28-3.us-west-2.compute.internal
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 12126)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
2024-04-23 21:08:51 - root - INFO - Command exited with code 1
This could be because of how we perform text encoding.
Could you maybe refer to this script and make adjustments accordingly?
I tried running just that script to see. The dataset is quite tiny. No luck. CUDA out of memory still for multiple GPUs
File "/home/ubuntu/src/mvp/backend/train/huggingface_train_scripts/train_controlnet_webdataset_sdxl.py", line 1227, in main
File "/home/ubuntu/src/mvp/backend/train/huggingface_train_scripts/train_controlnet_webdataset_sdxl.py", line 1227, in main
controlnet, optimizer, lr_scheduler = accelerator.prepare(controlnet, optimizer, lr_scheduler)
File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1202, in prepare
controlnet, optimizer, lr_scheduler = accelerator.prepare(controlnet, optimizer, lr_scheduler)
File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1202, in prepare
controlnet, optimizer, lr_scheduler = accelerator.prepare(controlnet, optimizer, lr_scheduler)
File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1202, in prepare
result = tuple(
File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1203, in <genexpr>
result = tuple(
File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1203, in <genexpr>
result = tuple(
File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1203, in <genexpr>
self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1030, in _prepare_one
self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1030, in _prepare_one
self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1030, in _prepare_one
return self.prepare_model(obj, device_placement=device_placement)
File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1340, in prepare_model
return self.prepare_model(obj, device_placement=device_placement)
File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1340, in prepare_model
return self.prepare_model(obj, device_placement=device_placement)
File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1340, in prepare_model
model = torch.nn.parallel.DistributedDataParallel(
File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 688, in __init__
model = torch.nn.parallel.DistributedDataParallel(
File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 688, in __init__
model = torch.nn.parallel.DistributedDataParallel(
File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 688, in __init__
self._ddp_init_helper(
File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 825, in _ddp_init_helper
self._ddp_init_helper(
File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 825, in _ddp_init_helper
self._ddp_init_helper(
File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 825, in _ddp_init_helper
self.reducer = dist.Reducer(
torch.cuda. OutOfMemoryErrorself.reducer = dist.Reducer(:
CUDA out of memory. Tried to allocate 4.66 GiB (GPU 2; 15.77 GiB total capacity; 11.28 GiB already allocated; 3.53 GiB free; 11.57 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.66 GiB (GPU 3; 15.77 GiB total capacity; 11.28 GiB already allocated; 3.53 GiB free; 11.57 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Did you only change the dataloader to WebDataset? I am afraid only that is probably not going to work. In https://github.com/huggingface/diffusers/blob/main/examples/research_projects/controlnet/train_controlnet_webdataset.py, we also compute the caption embeddings when doing the epochs and NOT ahead of it.
Ahh I understand now. I will give that a try. FWIW, I also looked at using deepspeed and other routes, all with the same result. But I'll pursue the embeddings change here.
So, we're going to have large datasets anyways. I opted to swap to webdataset. I tried running the webdataset controlnet script, and still it fails. The test dataset itself is 12 images.
Seems, the embeddings change had no tangible effect here.
Traceback (most recent call last):
Traceback (most recent call last):
File "/home/ubuntu/src/mvp/backend/train/huggingface_train_scripts/train_controlnet_webdataset_sdxl.py", line 1462, in <module>
File "/home/ubuntu/src/mvp/backend/train/huggingface_train_scripts/train_controlnet_webdataset_sdxl.py", line 1462, in <module>
Traceback (most recent call last):
Traceback (most recent call last):
File "/home/ubuntu/src/mvp/backend/train/huggingface_train_scripts/train_controlnet_webdataset_sdxl.py", line 1462, in <module>
File "/home/ubuntu/src/mvp/backend/train/huggingface_train_scripts/train_controlnet_webdataset_sdxl.py", line 1462, in <module>
main(args)main(args)main(args)main(args)
File "/home/ubuntu/src/mvp/backend/train/huggingface_train_scripts/train_controlnet_webdataset_sdxl.py", line 1221, in main
File "/home/ubuntu/src/mvp/backend/train/huggingface_train_scripts/train_controlnet_webdataset_sdxl.py", line 1221, in main
File "/home/ubuntu/src/mvp/backend/train/huggingface_train_scripts/train_controlnet_webdataset_sdxl.py", line 1221, in main
File "/home/ubuntu/src/mvp/backend/train/huggingface_train_scripts/train_controlnet_webdataset_sdxl.py", line 1221, in main
controlnet, optimizer, lr_scheduler = accelerator.prepare(controlnet, optimizer, lr_scheduler)controlnet, optimizer, lr_scheduler = accelerator.prepare(controlnet, optimizer, lr_scheduler)
File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1202, in prepare
File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1202, in prepare
controlnet, optimizer, lr_scheduler = accelerator.prepare(controlnet, optimizer, lr_scheduler)controlnet, optimizer, lr_scheduler = accelerator.prepare(controlnet, optimizer, lr_scheduler)
File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1202, in prepare
File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1202, in prepare
result = tuple(result = tuple(
File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1203, in <genexpr>
File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1203, in <genexpr>
result = tuple(
File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1203, in <genexpr>
result = tuple(
File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1203, in <genexpr>
self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1030, in _prepare_one
File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1030, in _prepare_one
self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1030, in _prepare_one
self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1030, in _prepare_one
return self.prepare_model(obj, device_placement=device_placement)return self.prepare_model(obj, device_placement=device_placement)return self.prepare_model(obj, device_placement=device_placement)
File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1340, in prepare_model
File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1340, in prepare_model
File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1340, in prepare_model
return self.prepare_model(obj, device_placement=device_placement)
File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1340, in prepare_model
model = torch.nn.parallel.DistributedDataParallel(model = torch.nn.parallel.DistributedDataParallel(model = torch.nn.parallel.DistributedDataParallel(
File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 688, in __init__
File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 688, in __init__
File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 688, in __init__
model = torch.nn.parallel.DistributedDataParallel(
File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 688, in __init__
self._ddp_init_helper(self._ddp_init_helper(self._ddp_init_helper(self._ddp_init_helper(
File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 825, in _ddp_init_helper
File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 825, in _ddp_init_helper
File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 825, in _ddp_init_helper
self.reducer = dist.Reducer(self.reducer = dist.Reducer(self.reducer = dist.Reducer(self.reducer = dist.Reducer(
torch.cudatorch.cudatorch.cuda.torch.cuda..OutOfMemoryError.OutOfMemoryErrorOutOfMemoryError: OutOfMemoryError: : : CUDA out of memory. Tried to allocate 4.66 GiB (GPU 1; 15.77 GiB total capacity; 11.28 GiB already allocated; 3.61 GiB free; 11.57 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONFCUDA out of memory. Tried to allocate 4.66 GiB (GPU 2; 15.77 GiB total capacity; 11.28 GiB already allocated; 3.53 GiB free; 11.57 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONFCUDA out of memory. Tried to allocate 4.66 GiB (GPU 3; 15.77 GiB total capacity; 11.28 GiB already allocated; 3.53 GiB free; 11.57 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONFCUDA out of memory. Tried to allocate 4.66 GiB (GPU 0; 15.77 GiB total capacity; 11.28 GiB already allocated; 3.61 GiB free; 11.57 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
I switched to SageMaker (p3.8xlarge) here to see if there's any availability. Still failing. Guidance requested.
!accelerate launch --mixed_precision="fp16" --num_processes=4 --multi_gpu scripts/train_controlnet_webdataset.py \
--pretrained_model_name_or_path "stabilityai/stable-diffusion-xl-base-1.0" \
--train_shards_path_or_url data \
--eval_shards_path_or_url data \
--output_dir output \
--pretrained_vae_model_name_or_path "madebyollin/sdxl-vae-fp16-fix" \
--controlnet_model_name_or_path "diffusers/controlnet-canny-sdxl-1.0" \
--resolution 512 \
--train_batch_size 1 \
--lr_scheduler constant \
--lr_warmup_steps 0 \
--max_train_samples 1000 \
--seed 0 \
--mixed_precision="fp16" \
--gradient_accumulation_steps=4 \
--use_8bit_adam \
--enable_xformers_memory_efficient_attention
[W Utils.hpp:135] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function getCvarInt)
[W Utils.hpp:135] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function getCvarInt)
[W Utils.hpp:135] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function getCvarInt)
[W Utils.hpp:135] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function getCvarInt)
/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/accelerate/accelerator.py:401: UserWarning: `log_with=tensorboard` was passed but no supported trackers are currently installed.
warnings.warn(f"`log_with={log_with}` was passed but no supported trackers are currently installed.")
05/06/2024 17:09:09 - INFO - __main__ - Distributed environment: MULTI_GPU Backend: nccl
Num processes: 4
Process index: 0
Local process index: 0
Device: cuda:0
Mixed precision type: fp16
/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
warnings.warn(
/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/accelerate/accelerator.py:401: UserWarning: `log_with=tensorboard` was passed but no supported trackers are currently installed.
warnings.warn(f"`log_with={log_with}` was passed but no supported trackers are currently installed.")
/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/accelerate/accelerator.py:401: UserWarning: `log_with=tensorboard` was passed but no supported trackers are currently installed.
warnings.warn(f"`log_with={log_with}` was passed but no supported trackers are currently installed.")
05/06/2024 17:09:09 - INFO - __main__ - Distributed environment: MULTI_GPU Backend: nccl
Num processes: 4
Process index: 2
Local process index: 2
Device: cuda:2
Mixed precision type: fp16
05/06/2024 17:09:09 - INFO - __main__ - Distributed environment: MULTI_GPU Backend: nccl
Num processes: 4
Process index: 1
Local process index: 1
Device: cuda:1
Mixed precision type: fp16
/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/accelerate/accelerator.py:401: UserWarning: `log_with=tensorboard` was passed but no supported trackers are currently installed.
warnings.warn(f"`log_with={log_with}` was passed but no supported trackers are currently installed.")
/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
warnings.warn(
/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
warnings.warn(
05/06/2024 17:09:09 - INFO - __main__ - Distributed environment: MULTI_GPU Backend: nccl
Num processes: 4
Process index: 3
Local process index: 3
Device: cuda:3
Mixed precision type: fp16
/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
warnings.warn(
You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
{'timestep_type', 'sigma_min', 'sigma_max', 'rescale_betas_zero_snr'} was not found in config. Values will be initialized to default values.
{'latents_mean', 'latents_std'} was not found in config. Values will be initialized to default values.
{'attention_type', 'reverse_transformer_layers_per_block', 'dropout'} was not found in config. Values will be initialized to default values.
05/06/2024 17:09:14 - INFO - __main__ - Loading existing controlnet weights
{'mid_block_type'} was not found in config. Values will be initialized to default values.
[rank3]: Traceback (most recent call last):
[rank3]: File "/home/ec2-user/SageMaker/scripts/train_controlnet_webdataset.py", line 1472, in <module>
[rank3]: main(args)
[rank3]: File "/home/ec2-user/SageMaker/scripts/train_controlnet_webdataset.py", line 1231, in main
[rank3]: controlnet, optimizer, lr_scheduler = accelerator.prepare(controlnet, optimizer, lr_scheduler)
[rank3]: File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/accelerate/accelerator.py", line 1292, in prepare
[rank3]: result = tuple(
[rank3]: File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/accelerate/accelerator.py", line 1293, in <genexpr>
[rank3]: self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
[rank3]: File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/accelerate/accelerator.py", line 1169, in _prepare_one
[rank3]: return self.prepare_model(obj, device_placement=device_placement)
[rank3]: File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/accelerate/accelerator.py", line 1428, in prepare_model
[rank3]: model = torch.nn.parallel.DistributedDataParallel(
[rank3]: File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 812, in __init__
[rank3]: self._ddp_init_helper(
[rank3]: File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1152, in _ddp_init_helper
[rank3]: self.reducer = dist.Reducer(
[rank3]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.66 GiB. GPU has a total capacity of 15.77 GiB of which 3.51 GiB is free. Including non-PyTorch memory, this process has 12.26 GiB memory in use. Of the allocated memory 11.28 GiB is allocated by PyTorch, and 299.43 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[rank1]: Traceback (most recent call last):
[rank1]: File "/home/ec2-user/SageMaker/scripts/train_controlnet_webdataset.py", line 1472, in <module>
[rank1]: main(args)
[rank1]: File "/home/ec2-user/SageMaker/scripts/train_controlnet_webdataset.py", line 1231, in main
[rank1]: controlnet, optimizer, lr_scheduler = accelerator.prepare(controlnet, optimizer, lr_scheduler)
[rank1]: File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/accelerate/accelerator.py", line 1292, in prepare
[rank1]: result = tuple(
[rank1]: File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/accelerate/accelerator.py", line 1293, in <genexpr>
[rank1]: self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
[rank1]: File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/accelerate/accelerator.py", line 1169, in _prepare_one
[rank1]: return self.prepare_model(obj, device_placement=device_placement)
[rank1]: File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/accelerate/accelerator.py", line 1428, in prepare_model
[rank1]: model = torch.nn.parallel.DistributedDataParallel(
[rank1]: File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 812, in __init__
[rank1]: self._ddp_init_helper(
[rank1]: File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1152, in _ddp_init_helper
[rank1]: self.reducer = dist.Reducer(
[rank1]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.66 GiB. GPU has a total capacity of 15.77 GiB of which 3.53 GiB is free. Including non-PyTorch memory, this process has 12.24 GiB memory in use. Of the allocated memory 11.28 GiB is allocated by PyTorch, and 299.43 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[rank2]: Traceback (most recent call last):
[rank2]: File "/home/ec2-user/SageMaker/scripts/train_controlnet_webdataset.py", line 1472, in <module>
[rank2]: main(args)
[rank2]: File "/home/ec2-user/SageMaker/scripts/train_controlnet_webdataset.py", line 1231, in main
[rank2]: controlnet, optimizer, lr_scheduler = accelerator.prepare(controlnet, optimizer, lr_scheduler)
[rank2]: File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/accelerate/accelerator.py", line 1292, in prepare
[rank2]: result = tuple(
[rank2]: File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/accelerate/accelerator.py", line 1293, in <genexpr>
[rank2]: self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
[rank2]: File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/accelerate/accelerator.py", line 1169, in _prepare_one
[rank2]: return self.prepare_model(obj, device_placement=device_placement)
[rank2]: File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/accelerate/accelerator.py", line 1428, in prepare_model
[rank2]: model = torch.nn.parallel.DistributedDataParallel(
[rank2]: File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 812, in __init__
[rank2]: self._ddp_init_helper(
[rank2]: File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1152, in _ddp_init_helper
[rank2]: self.reducer = dist.Reducer(
[rank2]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.66 GiB. GPU has a total capacity of 15.77 GiB of which 3.49 GiB is free. Including non-PyTorch memory, this process has 12.28 GiB memory in use. Of the allocated memory 11.28 GiB is allocated by PyTorch, and 299.43 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/ec2-user/SageMaker/scripts/train_controlnet_webdataset.py", line 1472, in <module>
[rank0]: main(args)
[rank0]: File "/home/ec2-user/SageMaker/scripts/train_controlnet_webdataset.py", line 1231, in main
[rank0]: controlnet, optimizer, lr_scheduler = accelerator.prepare(controlnet, optimizer, lr_scheduler)
[rank0]: File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/accelerate/accelerator.py", line 1292, in prepare
[rank0]: result = tuple(
[rank0]: File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/accelerate/accelerator.py", line 1293, in <genexpr>
[rank0]: self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
[rank0]: File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/accelerate/accelerator.py", line 1169, in _prepare_one
[rank0]: return self.prepare_model(obj, device_placement=device_placement)
[rank0]: File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/accelerate/accelerator.py", line 1428, in prepare_model
[rank0]: model = torch.nn.parallel.DistributedDataParallel(
[rank0]: File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 812, in __init__
[rank0]: self._ddp_init_helper(
[rank0]: File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1152, in _ddp_init_helper
[rank0]: self.reducer = dist.Reducer(
[rank0]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.66 GiB. GPU
E0506 17:09:25.526000 140715603318592 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 65009) of binary: /home/ec2-user/anaconda3/envs/pytorch_p310/bin/python3.10
Traceback (most recent call last):
File "/home/ec2-user/anaconda3/envs/pytorch_p310/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 46, in main
args.func(args)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1073, in launch_command
multi_gpu_launcher(args)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/accelerate/commands/launch.py", line 718, in multi_gpu_launcher
distrib_run.run(args)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
scripts/train_controlnet_webdataset.py FAILED
------------------------------------------------------------
Failures:
[1]:
time : 2024-05-06_17:09:25
host : ip-172-16-92-209.us-west-2.compute.internal
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 65010)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time : 2024-05-06_17:09:25
host : ip-172-16-92-209.us-west-2.compute.internal
rank : 2 (local_rank: 2)
exitcode : 1 (pid: 65011)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
time : 2024-05-06_17:09:25
host : ip-172-16-92-209.us-west-2.compute.internal
rank : 3 (local_rank: 3)
exitcode : 1 (pid: 65012)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-05-06_17:09:25
host : ip-172-16-92-209.us-west-2.compute.internal
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 65009)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Where is text being encoded? Also the training script you're using -- did you perform any major modifications to it?
Where is text being encoded? Also the training script you're using -- did you perform any major modifications to it?
In this case I just used the webdata set controlnet scriptwith no changes.
I created a 12 frame w/ txt file tarball (using webdataset's api) to serve as my dataset.
I think the 16 GB RAM on the V100's are simple not sufficient. You were previously on an A100.
Describe the bug
Hi there. I've reliably used the train_controlnet_sdxl.py on single gpu on GCP (A100 - 40 GB). I have had to switch to AWS and am presently using a p3.8xlarge which has 4 V100 gpus w/ 64 GB GPU memory total.
Whenever I run my workflow on AWS I get a Cuda out of memory error just loading the dataset. I built bitsandbytes following HuggingFace's documentation. Again, this works on single GPU just fine.
Reproduction
Possible issues:
- Could not find the bitsandbytes CUDA binary at PosixPath('/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117_nocublaslt.so'). It does seem it uses CUDA though as evidenced down below
accelerate command 1:
accelerate launch --multi_gpu /home/ubuntu/src/mvp/backend/train/huggingface_train_scripts/train_controlnet_sdxl.py --pretrained_model_name_or_path stabilityai/stable-diffusion-xl-base-1.0 --instance_data_dir /tmp/fai_cache/data/data --pretrained_vae_model_name_or_path madebyollin/sdxl-vae-fp16-fix --output_dir /tmp/demo-20240423032204 --resolution 512 --train_batch_size 1 --gradient_accumulation_steps 4 --learning_rate 1e-4 --lr_scheduler constant --lr_warmup_steps 0 --max_train_steps 1000 --checkpointing_steps 1000 --seed 0 --gradient_checkpointing --checkpoints_total_limit 3 --use_8bit_adam --enable_xformers_memory_efficient_attention --set_grads_to_none --dataloader_num_workers 4'
accelerate command 2:
accelerate launch --multi_gpu /home/ubuntu/src/mvp/backend/train/huggingface_train_scripts/train_controlnet_sdxl.py --pretrained_model_name_or_path stabilityai/stable-diffusion-xl-base-1.0 --instance_data_dir /tmp/fai_cache/data/data --pretrained_vae_model_name_or_path madebyollin/sdxl-vae-fp16-fix --output_dir /tmp/demo-20240423032204 --resolution 512 --train_batch_size 1 --gradient_accumulation_steps 4 --learning_rate 1e-4 --lr_scheduler constant --lr_warmup_steps 0 --max_train_steps 1000 --checkpointing_steps 1000 --seed 0 --gradient_checkpointing --checkpoints_total_limit 3 --use_8bit_adam --enable_xformers_memory_efficient_attention --set_grads_to_none
Logs
2024-04-23 03:22:04 - train.train - INFO - Running ['accelerate', 'launch', '--multi_gpu', '/home/ubuntu/src/mvp/backend/train/huggingface_train_scripts/train_controlnet_sdxl.py', '--pretrained_model_name_or_path', 'stabilityai/stable-diffusion-xl-base-1.0', '--instance_data_dir', '/tmp/fai_cache/data/data', '--pretrained_vae_model_name_or_path', 'madebyollin/sdxl-vae-fp16-fix', '--output_dir', '/tmp/demo-20240423032204', '--resolution', '512', '--train_batch_size', '1', '--gradient_accumulation_steps', '4', '--learning_rate', '1e-4', '--lr_scheduler', 'constant', '--lr_warmup_steps', '0', '--max_train_steps', '1000', '--checkpointing_steps', '1000', '--seed', '0', '--gradient_checkpointing', '--checkpoints_total_limit', '3', '--use_8bit_adam', '--enable_xformers_memory_efficient_attention', '--set_grads_to_none', '--dataloader_num_workers', '4'] 2024-04-23 03:22:48 - root - INFO - Could not find the bitsandbytes CUDA binary at PosixPath('/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117_nocublaslt.so') Could not find the bitsandbytes CUDA binary at PosixPath('/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117_nocublaslt.so') 04/23/2024 03:22:12 - INFO - __main__ - Distributed environment: MULTI_GPU Backend: nccl Num processes: 1 Process index: 0 Local process index: 0 Device: cuda:0 Mixed precision type: fp16 You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors. You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors. {'variance_type', 'clip_sample_range', 'dynamic_thresholding_ratio', 'rescale_betas_zero_snr', 'thresholding'} was not found in config. Values will be initialized to default values. {'latents_mean', 'latents_std'} was not found in config. Values will be initialized to default values. {'dropout', 'attention_type', 'reverse_transformer_layers_per_block'} was not found in config. Values will be initialized to default values. 04/23/2024 03:22:23 - INFO - __main__ - Initializing controlnet weights from unet Map: 0%| | 0/29 [00:00<?, ? examples/s] Map: 100%|ββββββββββ| 29/29 [00:00<00:00, 50.84 examples/s] Map: 100%|ββββββββββ| 29/29 [00:00<00:00, 49.14 examples/s] Traceback (most recent call last): File "/home/ubuntu/src/mvp/backend/train/huggingface_train_scripts/train_controlnet_sdxl.py", line 1438, in <module> main(args) File "/home/ubuntu/src/mvp/backend/train/huggingface_train_scripts/train_controlnet_sdxl.py", line 1192, in main controlnet, optimizer, train_dataloader, lr_scheduler = accelerator.prepare( File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1202, in prepare result = tuple( File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1203, in <genexpr> self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement) File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1030, in _prepare_one return self.prepare_model(obj, device_placement=device_placement) File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1340, in prepare_model model = torch.nn.parallel.DistributedDataParallel( File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 688, in __init__ self._ddp_init_helper( File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 825, in _ddp_init_helper self.reducer = dist.Reducer( torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.66 GiB (GPU 0; 15.77 GiB total capacity; 11.28 GiB already allocated; 3.52 GiB free; 11.59 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 109689) of binary: /home/ubuntu/anaconda3/envs/mvp/bin/python Traceback (most recent call last): File "/home/ubuntu/anaconda3/envs/mvp/bin/accelerate", line 8, in <module> sys.exit(main()) File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main args.func(args) File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/commands/launch.py", line 970, in launch_command multi_gpu_launcher(args) File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/commands/launch.py", line 646, in multi_gpu_launcher distrib_run.run(args) File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /home/ubuntu/src/mvp/backend/train/huggingface_train_scripts/train_controlnet_sdxl.py FAILED ------------------------------------------------------------ Failures: <NO_OTHER_FAILURES> ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-04-23_03:22:47 host : rank : 0 (local_rank: 0) exitcode : 1 (pid: 109689) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================
System Info
diffusers
version: 0.28.0.dev0- Platform: Linux-6.5.0-1018-aws-x86_64-with-glibc2.35
- Python version: 3.10.6
- PyTorch version (GPU?): 2.0.1+cu117 (True)
- Huggingface_hub version: 0.22.2
- Transformers version: 4.31.0
- Accelerate version: 0.21.0
- xFormers version: 0.0.22
- Using GPU in script?: yes
- Using distributed or parallel set-up in script?: yes
+---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.171.04 Driver Version: 535.171.04 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 Tesla V100-SXM2-16GB Off | 00000000:00:1B.0 Off | 0 | | N/A 35C P0 36W / 300W | 0MiB / 16384MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 1 Tesla V100-SXM2-16GB Off | 00000000:00:1C.0 Off | 0 | | N/A 38C P0 37W / 300W | 0MiB / 16384MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 2 Tesla V100-SXM2-16GB Off | 00000000:00:1D.0 Off | 0 | | N/A 34C P0 38W / 300W | 0MiB / 16384MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 3 Tesla V100-SXM2-16GB Off | 00000000:00:1E.0 Off | 0 | | N/A 34C P0 37W / 300W | 0MiB / 16384MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ +---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | No running processes found |
Who can help?
@sayakpaul @yiyixuxu
Describe the bug
Hi there. I've reliably used the train_controlnet_sdxl.py on single gpu on GCP (A100 - 40 GB). I have had to switch to AWS and am presently using a p3.8xlarge which has 4 V100 gpus w/ 64 GB GPU memory total.
Whenever I run my workflow on AWS I get a Cuda out of memory error just loading the dataset. I built bitsandbytes following HuggingFace's documentation. Again, this works on single GPU just fine.
Reproduction
Possible issues:
- Could not find the bitsandbytes CUDA binary at PosixPath('/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117_nocublaslt.so'). It does seem it uses CUDA though as evidenced down below
accelerate command 1:
accelerate launch --multi_gpu /home/ubuntu/src/mvp/backend/train/huggingface_train_scripts/train_controlnet_sdxl.py --pretrained_model_name_or_path stabilityai/stable-diffusion-xl-base-1.0 --instance_data_dir /tmp/fai_cache/data/data --pretrained_vae_model_name_or_path madebyollin/sdxl-vae-fp16-fix --output_dir /tmp/demo-20240423032204 --resolution 512 --train_batch_size 1 --gradient_accumulation_steps 4 --learning_rate 1e-4 --lr_scheduler constant --lr_warmup_steps 0 --max_train_steps 1000 --checkpointing_steps 1000 --seed 0 --gradient_checkpointing --checkpoints_total_limit 3 --use_8bit_adam --enable_xformers_memory_efficient_attention --set_grads_to_none --dataloader_num_workers 4'
accelerate command 2:
accelerate launch --multi_gpu /home/ubuntu/src/mvp/backend/train/huggingface_train_scripts/train_controlnet_sdxl.py --pretrained_model_name_or_path stabilityai/stable-diffusion-xl-base-1.0 --instance_data_dir /tmp/fai_cache/data/data --pretrained_vae_model_name_or_path madebyollin/sdxl-vae-fp16-fix --output_dir /tmp/demo-20240423032204 --resolution 512 --train_batch_size 1 --gradient_accumulation_steps 4 --learning_rate 1e-4 --lr_scheduler constant --lr_warmup_steps 0 --max_train_steps 1000 --checkpointing_steps 1000 --seed 0 --gradient_checkpointing --checkpoints_total_limit 3 --use_8bit_adam --enable_xformers_memory_efficient_attention --set_grads_to_none
Logs
2024-04-23 03:22:04 - train.train - INFO - Running ['accelerate', 'launch', '--multi_gpu', '/home/ubuntu/src/mvp/backend/train/huggingface_train_scripts/train_controlnet_sdxl.py', '--pretrained_model_name_or_path', 'stabilityai/stable-diffusion-xl-base-1.0', '--instance_data_dir', '/tmp/fai_cache/data/data', '--pretrained_vae_model_name_or_path', 'madebyollin/sdxl-vae-fp16-fix', '--output_dir', '/tmp/demo-20240423032204', '--resolution', '512', '--train_batch_size', '1', '--gradient_accumulation_steps', '4', '--learning_rate', '1e-4', '--lr_scheduler', 'constant', '--lr_warmup_steps', '0', '--max_train_steps', '1000', '--checkpointing_steps', '1000', '--seed', '0', '--gradient_checkpointing', '--checkpoints_total_limit', '3', '--use_8bit_adam', '--enable_xformers_memory_efficient_attention', '--set_grads_to_none', '--dataloader_num_workers', '4'] 2024-04-23 03:22:48 - root - INFO - Could not find the bitsandbytes CUDA binary at PosixPath('/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117_nocublaslt.so') Could not find the bitsandbytes CUDA binary at PosixPath('/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117_nocublaslt.so') 04/23/2024 03:22:12 - INFO - __main__ - Distributed environment: MULTI_GPU Backend: nccl Num processes: 1 Process index: 0 Local process index: 0 Device: cuda:0 Mixed precision type: fp16 You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors. You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors. {'variance_type', 'clip_sample_range', 'dynamic_thresholding_ratio', 'rescale_betas_zero_snr', 'thresholding'} was not found in config. Values will be initialized to default values. {'latents_mean', 'latents_std'} was not found in config. Values will be initialized to default values. {'dropout', 'attention_type', 'reverse_transformer_layers_per_block'} was not found in config. Values will be initialized to default values. 04/23/2024 03:22:23 - INFO - __main__ - Initializing controlnet weights from unet Map: 0%| | 0/29 [00:00<?, ? examples/s] Map: 100%|ββββββββββ| 29/29 [00:00<00:00, 50.84 examples/s] Map: 100%|ββββββββββ| 29/29 [00:00<00:00, 49.14 examples/s] Traceback (most recent call last): File "/home/ubuntu/src/mvp/backend/train/huggingface_train_scripts/train_controlnet_sdxl.py", line 1438, in <module> main(args) File "/home/ubuntu/src/mvp/backend/train/huggingface_train_scripts/train_controlnet_sdxl.py", line 1192, in main controlnet, optimizer, train_dataloader, lr_scheduler = accelerator.prepare( File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1202, in prepare result = tuple( File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1203, in <genexpr> self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement) File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1030, in _prepare_one return self.prepare_model(obj, device_placement=device_placement) File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/accelerator.py", line 1340, in prepare_model model = torch.nn.parallel.DistributedDataParallel( File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 688, in __init__ self._ddp_init_helper( File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 825, in _ddp_init_helper self.reducer = dist.Reducer( torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.66 GiB (GPU 0; 15.77 GiB total capacity; 11.28 GiB already allocated; 3.52 GiB free; 11.59 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 109689) of binary: /home/ubuntu/anaconda3/envs/mvp/bin/python Traceback (most recent call last): File "/home/ubuntu/anaconda3/envs/mvp/bin/accelerate", line 8, in <module> sys.exit(main()) File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main args.func(args) File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/commands/launch.py", line 970, in launch_command multi_gpu_launcher(args) File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/accelerate/commands/launch.py", line 646, in multi_gpu_launcher distrib_run.run(args) File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/home/ubuntu/anaconda3/envs/mvp/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /home/ubuntu/src/mvp/backend/train/huggingface_train_scripts/train_controlnet_sdxl.py FAILED ------------------------------------------------------------ Failures: <NO_OTHER_FAILURES> ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-04-23_03:22:47 host : rank : 0 (local_rank: 0) exitcode : 1 (pid: 109689) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================
System Info
diffusers
version: 0.28.0.dev0- Platform: Linux-6.5.0-1018-aws-x86_64-with-glibc2.35
- Python version: 3.10.6
- PyTorch version (GPU?): 2.0.1+cu117 (True)
- Huggingface_hub version: 0.22.2
- Transformers version: 4.31.0
- Accelerate version: 0.21.0
- xFormers version: 0.0.22
- Using GPU in script?: yes
- Using distributed or parallel set-up in script?: yes
+---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.171.04 Driver Version: 535.171.04 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 Tesla V100-SXM2-16GB Off | 00000000:00:1B.0 Off | 0 | | N/A 35C P0 36W / 300W | 0MiB / 16384MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 1 Tesla V100-SXM2-16GB Off | 00000000:00:1C.0 Off | 0 | | N/A 38C P0 37W / 300W | 0MiB / 16384MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 2 Tesla V100-SXM2-16GB Off | 00000000:00:1D.0 Off | 0 | | N/A 34C P0 38W / 300W | 0MiB / 16384MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 3 Tesla V100-SXM2-16GB Off | 00000000:00:1E.0 Off | 0 | | N/A 34C P0 37W / 300W | 0MiB / 16384MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ +---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | No running processes found |
Who can help?
@sayakpaul @yiyixuxu
hi, @asutermo :
i'm not sure that "35GB when setting --train_batch_size=1 and --resolution=1024." is true or not, but pls try as below: 1γmake sure that --use_8bit_adam --enable_xformers_memory_efficient_attention --set_grads_to_none 2γfind "optimizer_class = bnb.optim.AdamW8bit" => "bnb.optim.PagedAdamW8bit " . train_controlnet_sdxl.py or whatever you are using
Good luck!
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Describe the bug
Hi there. I've reliably used the train_controlnet_sdxl.py on single gpu on GCP (A100 - 40 GB). I have had to switch to AWS and am presently using a p3.8xlarge which has 4 V100 gpus w/ 64 GB GPU memory total.
Whenever I run my workflow on AWS I get a Cuda out of memory error just loading the dataset. I built bitsandbytes following HuggingFace's documentation. Again, this works on single GPU just fine.
Reproduction
Possible issues:
accelerate command 1:
accelerate command 2:
Logs
System Info
diffusers
version: 0.28.0.dev0Who can help?
@sayakpaul @yiyixuxu