huggingface / diffusers

๐Ÿค— Diffusers: State-of-the-art diffusion models for image and audio generation in PyTorch and FLAX.
https://huggingface.co/docs/diffusers
Apache License 2.0
25.49k stars 5.28k forks source link

train_text_to_image.py multi_gpu training cuda out of memory error but sufficient memory when using single GPU #3382

Closed TrueWheelProgramming closed 1 year ago

TrueWheelProgramming commented 1 year ago

Describe the bug

When using the train_text_to_image.py example script on a single NVIDIA A10G GPU the example script works great. However, when using 4xNVIDIA A10G if I use the same input arguments but use the --multi_gpu accelerate flag all 4 of the GPUs run out of memory before the first step is complete.

In the single GPU case I can train with batch size of 1 and resolution of 512.

In multi_gpu case even a batch size of 1 and resolution of 64 results in a cuda out of memory error.

Is there any reason why multi-gpu will use significantly more memory?

Is there an issue with my (i) accelerate config or (ii) script arguments?

Thanks in advance.

Reproduction

Accelerate Config

compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 0
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Command

export MODEL_NAME="CompVis/stable-diffusion-v1-4"
export dataset_name="lambdalabs/pokemon-blip-captions"

accelerate launch --mixed_precision="fp16" --multi_gpu train_text_to_image.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --dataset_name=$dataset_name \
  --use_ema \
  --resolution=64 --center_crop --random_flip \
  --train_batch_size=1 \
  --gradient_accumulation_steps=4 \
  --gradient_checkpointing \
  --max_train_steps=15000 \
  --learning_rate=1e-05 \
  --max_grad_norm=1 \
  --lr_scheduler="constant" --lr_warmup_steps=0 \
  --output_dir="sd-pokemon-model"

Logs

RuntimeError: CUDA out of memory. Tried to allocate 30.00 MiB (GPU 2; 22.20 GiB total capacity; 20.13 GiB already allocated; 26.06 MiB free; 20.29 GiB reserved in total by 
PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and 
PYTORCH_CUDA_ALLOC_CONF
Steps:   0%|                                                                                                  | 1/15000 [00:06<26:26:53,  6.35s/it, lr=1e-5, step_loss=0.44]
[10:27:31] ERROR    failed (exitcode: 1) local_rank: 0 (pid: 75374) of binary: /opt/conda/bin/python3.9                                                           api.py:671
โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Traceback (most recent call last) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ /home/ubuntu/.local/bin/accelerate:8 in <module>                                                 โ”‚
โ”‚                                                                                                  โ”‚
โ”‚   5 from accelerate.commands.accelerate_cli import main                                          โ”‚
โ”‚   6 if __name__ == '__main__':                                                                   โ”‚
โ”‚   7 โ”‚   sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0])                         โ”‚
โ”‚ โฑ 8 โ”‚   sys.exit(main())                                                                         โ”‚
โ”‚   9                                                                                              โ”‚
โ”‚                                                                                                  โ”‚
โ”‚ /home/ubuntu/.local/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py:45 in main โ”‚
โ”‚                                                                                                  โ”‚
โ”‚   42 โ”‚   โ”‚   exit(1)                                                                             โ”‚
โ”‚   43 โ”‚                                                                                           โ”‚
โ”‚   44 โ”‚   # Run                                                                                   โ”‚
โ”‚ โฑ 45 โ”‚   args.func(args)                                                                         โ”‚
โ”‚   46                                                                                             โ”‚
โ”‚   47                                                                                             โ”‚
โ”‚   48 if __name__ == "__main__":                                                                  โ”‚
โ”‚                                                                                                  โ”‚
โ”‚ /home/ubuntu/.local/lib/python3.9/site-packages/accelerate/commands/launch.py:909 in             โ”‚
โ”‚ launch_command                                                                                   โ”‚
โ”‚                                                                                                  โ”‚
โ”‚   906 โ”‚   elif args.use_megatron_lm and not args.cpu:                                            โ”‚
โ”‚   907 โ”‚   โ”‚   multi_gpu_launcher(args)                                                           โ”‚
โ”‚   908 โ”‚   elif args.multi_gpu and not args.cpu:                                                  โ”‚
โ”‚ โฑ 909 โ”‚   โ”‚   multi_gpu_launcher(args)                                                           โ”‚
โ”‚   910 โ”‚   elif args.tpu and not args.cpu:                                                        โ”‚
โ”‚   911 โ”‚   โ”‚   if args.tpu_use_cluster:                                                           โ”‚
โ”‚   912 โ”‚   โ”‚   โ”‚   tpu_pod_launcher(args)                                                         โ”‚
โ”‚                                                                                                  โ”‚
โ”‚ /home/ubuntu/.local/lib/python3.9/site-packages/accelerate/commands/launch.py:604 in             โ”‚
โ”‚ multi_gpu_launcher                                                                               โ”‚
โ”‚                                                                                                  โ”‚
โ”‚   601 โ”‚   )                                                                                      โ”‚
โ”‚   602 โ”‚   with patch_environment(**current_env):                                                 โ”‚
โ”‚   603 โ”‚   โ”‚   try:                                                                               โ”‚
โ”‚ โฑ 604 โ”‚   โ”‚   โ”‚   distrib_run.run(args)                                                          โ”‚
โ”‚   605 โ”‚   โ”‚   except Exception:                                                                  โ”‚
โ”‚   606 โ”‚   โ”‚   โ”‚   if is_rich_available() and debug:                                              โ”‚
โ”‚   607 โ”‚   โ”‚   โ”‚   โ”‚   console = get_console()                                                    โ”‚
โ”‚                                                                                                  โ”‚
โ”‚ /home/ubuntu/.local/lib/python3.9/site-packages/torch/distributed/run.py:752 in run              โ”‚
โ”‚                                                                                                  โ”‚
โ”‚   749 โ”‚   โ”‚   )                                                                                  โ”‚
โ”‚   750 โ”‚                                                                                          โ”‚
โ”‚   751 โ”‚   config, cmd, cmd_args = config_from_args(args)                                         โ”‚
โ”‚ โฑ 752 โ”‚   elastic_launch(                                                                        โ”‚
โ”‚   753 โ”‚   โ”‚   config=config,                                                                     โ”‚
โ”‚   754 โ”‚   โ”‚   entrypoint=cmd,                                                                    โ”‚
โ”‚   755 โ”‚   )(*cmd_args)                                                                           โ”‚
โ”‚                                                                                                  โ”‚
โ”‚ /home/ubuntu/.local/lib/python3.9/site-packages/torch/distributed/launcher/api.py:131 in         โ”‚
โ”‚ __call__                                                                                         โ”‚
โ”‚                                                                                                  โ”‚
โ”‚   128 โ”‚   โ”‚   self._entrypoint = entrypoint                                                      โ”‚
โ”‚   129 โ”‚                                                                                          โ”‚
โ”‚   130 โ”‚   def __call__(self, *args):                                                             โ”‚
โ”‚ โฑ 131 โ”‚   โ”‚   return launch_agent(self._config, self._entrypoint, list(args))                    โ”‚
โ”‚   132                                                                                            โ”‚
โ”‚   133                                                                                            โ”‚
โ”‚   134 def _get_entrypoint_name(                                                                  โ”‚
โ”‚                                                                                                  โ”‚
โ”‚ /home/ubuntu/.local/lib/python3.9/site-packages/torch/distributed/launcher/api.py:245 in         โ”‚
โ”‚ launch_agent                                                                                     โ”‚
โ”‚                                                                                                  โ”‚
โ”‚   242 โ”‚   โ”‚   โ”‚   # if the error files for the failed children exist                             โ”‚
โ”‚   243 โ”‚   โ”‚   โ”‚   # @record will copy the first error (root cause)                               โ”‚
โ”‚   244 โ”‚   โ”‚   โ”‚   # to the error file of the launcher process.                                   โ”‚
โ”‚ โฑ 245 โ”‚   โ”‚   โ”‚   raise ChildFailedError(                                                        โ”‚
โ”‚   246 โ”‚   โ”‚   โ”‚   โ”‚   name=entrypoint_name,                                                      โ”‚
โ”‚   247 โ”‚   โ”‚   โ”‚   โ”‚   failures=result.failures,                                                  โ”‚
โ”‚   248 โ”‚   โ”‚   โ”‚   )                                                                              โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
ChildFailedError: 
============================================================
train_text_to_image.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2023-05-10_10:27:31
  host      : ip-10-0-6-163.eu-west-1.compute.internal
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 75375)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2023-05-10_10:27:31
  host      : ip-10-0-6-163.eu-west-1.compute.internal
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 75376)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2023-05-10_10:27:31
  host      : ip-10-0-6-163.eu-west-1.compute.internal
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 75377)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-05-10_10:27:31
  host      : ip-10-0-6-163.eu-west-1.compute.internal
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 75374)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

System Info

patrickvonplaten commented 1 year ago

It looks like you're using "multi-gpu" in a non-distributed environment:

num_processes: 1

Please make sure to use Distributed Multi GPU https://pytorch.org/tutorials/beginner/ddp_series_multigpu.html This should also be correctly set when using accelerate config before starting the training

TrueWheelProgramming commented 1 year ago

You're right. Example working with the correct config ๐Ÿคฆ

matyasbohacek commented 1 year ago

@TrueWheelProgramming how exactly did you resolve this, please?