Is 40 GB VRAM enough for training?

guoqincode commented 9 months ago

I was able to train on an 80G machine, if you want to train on a 40G machine, I would recommend lowering the batch size and increasing the gradient accumulation, and if it's still OOM, you can use deepspeed (I'll be integrating deepspeed training in the near future if there's enough training data)

SmileTAT commented 9 months ago

torchrun --nnodes=1 --nproc_per_node=2 train.py --config configs/training/train_stage_1.yaml with A100-80G is OOM

guoqincode commented 9 months ago

torchrun --nnodes=1 --nproc_per_node=2 train.py --config configs/training/train_stage_1.yaml with A100-80G is OOM

Can you provide your environment and training logs?

SmileTAT commented 9 months ago

1. NVIDIA-SMI

2. pip list

accelerate 0.25.0 aiohttp 3.9.1 aiosignal 1.3.1 altair 5.2.0 antlr4-python3-runtime 4.9.3 appdirs 1.4.4 asttokens 2.4.1 async-timeout 4.0.3 attrs 23.1.0 black 23.7.0 blinker 1.7.0 braceexpand 0.1.7 cachetools 5.3.2 certifi 2023.11.17 chardet 5.1.0 charset-normalizer 3.3.2 click 8.1.7 clip 1.0 cmake 3.27.9 contourpy 1.2.0 cycler 0.12.1 decorator 5.1.1 decord 0.6.0 diffusers 0.24.0 docker-pycreds 0.4.0 einops 0.7.0 exceptiongroup 1.2.0 executing 2.0.1 fairscale 0.4.13 filelock 3.13.1 fire 0.5.0 fonttools 4.46.0 frozenlist 1.4.0 fsspec 2023.12.1 ftfy 6.1.3 gitdb 4.0.11 GitPython 3.1.40 huggingface-hub 0.19.4 idna 3.6 imageio 2.33.1 importlib-metadata 6.11.0 invisible-watermark 0.2.0 ipython 8.18.1 jedi 0.19.1 Jinja2 3.1.2 jsonschema 4.20.0 jsonschema-specifications 2023.11.2 kiwisolver 1.4.5 kornia 0.6.9 lightning-utilities 0.10.0 lit 17.0.6 loralib 0.1.2 markdown-it-py 3.0.0 MarkupSafe 2.1.3 matplotlib 3.8.2 matplotlib-inline 0.1.6 mdurl 0.1.2 mpmath 1.3.0 multidict 6.0.4 mypy-extensions 1.0.0 natsort 8.4.0 networkx 3.2.1 ninja 1.11.1.1 numpy 1.26.2 nvidia-cublas-cu11 11.10.3.66 nvidia-cuda-cupti-cu11 11.7.101 nvidia-cuda-nvrtc-cu11 11.7.99 nvidia-cuda-runtime-cu11 11.7.99 nvidia-cudnn-cu11 8.5.0.96 nvidia-cufft-cu11 10.9.0.58 nvidia-curand-cu11 10.2.10.91 nvidia-cusolver-cu11 11.4.0.1 nvidia-cusparse-cu11 11.7.4.91 nvidia-nccl-cu11 2.14.3 nvidia-nvtx-cu11 11.7.91 omegaconf 2.3.0 open-clip-torch 2.23.0 opencv-python 4.6.0.66 packaging 23.2 pandas 2.1.3 parso 0.8.3 pathspec 0.11.2 pexpect 4.9.0 Pillow 10.1.0 pip 23.3.1 platformdirs 4.1.0 prompt-toolkit 3.0.41 protobuf 3.20.3 psutil 5.9.6 ptyprocess 0.7.0 pudb 2023.1 pure-eval 0.2.2 pyarrow 14.0.1 pydeck 0.8.1b0 Pygments 2.17.2 pyparsing 3.1.1 python-dateutil 2.8.2 pytorch-lightning 2.0.1 pytz 2023.3.post1 PyWavelets 1.5.0 PyYAML 6.0.1 referencing 0.31.1 regex 2023.10.3 requests 2.31.0 rich 13.7.0 rpds-py 0.13.2 safetensors 0.4.1 scipy 1.11.4 sentencepiece 0.1.99 sentry-sdk 1.38.0 setproctitle 1.3.3 setuptools 68.0.0 six 1.16.0 smmap 5.0.1 stack-data 0.6.3 streamlit 1.29.0 sympy 1.12 tenacity 8.2.3 tensorboardX 2.6 termcolor 2.4.0 timm 0.9.12 tokenizers 0.12.1 toml 0.10.2 tomli 2.0.1 toolz 0.12.0 torch 2.0.1 torchaudio 2.0.2 torchdata 0.6.1 torchmetrics 1.2.1 torchvision 0.15.2 tornado 6.4 tqdm 4.66.1 traitlets 5.14.0 transformers 4.32.0 triton 2.0.0 typing_extensions 4.8.0 tzdata 2023.3 tzlocal 5.2 urllib3 1.26.18 urwid 2.3.4 urwid-readline 0.13 validators 0.22.0 wandb 0.16.1 watchdog 3.0.0 wcwidth 0.2.12 webdataset 0.2.83 wheel 0.41.2 xformers 0.0.22 yarl 1.9.3 zipp 3.17.0

3. cmd

CUDA_VISIBLE_DEVICES=6,7 torchrun --nnodes=1 --nproc_per_node=2 train.py --config configs/training/train_stage_1.yaml

4. logs

Steps: 0%| | 27/30000 [00:24<7:22:00, 1.13it/s, lr=0.0001, step_loss=0.0587]Traceback (most recent call last): File "AnimateAnyone-unofficial/train.py", line 574, in main(name=name, launcher=args.launcher, use_wandb=args.wandb, config) File "AnimateAnyone-unofficial/train.py", line 468, in main scaler.step(optimizer) File "anaconda/envs/generative-models/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 374, in step retval = self._maybe_opt_step(optimizer, optimizer_state, *args, *kwargs) File "anaconda/envs/generative-models/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 290, in _maybe_opt_step retval = optimizer.step(args, kwargs) File "anaconda/envs/generative-models/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 69, in wrapper return wrapped(*args, kwargs) File "anaconda/envs/generative-models/lib/python3.10/site-packages/torch/optim/optimizer.py", line 280, in wrapper out = func(*args, *kwargs) File "anaconda/envs/generative-models/lib/python3.10/site-packages/torch/optim/optimizer.py", line 33, in _use_grad ret = func(self, args, kwargs) File "anaconda/envs/generative-models/lib/python3.10/site-packages/torch/optim/adamw.py", line 171, in step adamw( File "anaconda/envs/generative-models/lib/python3.10/site-packages/torch/optim/adamw.py", line 321, in adamw func( File "anaconda/envs/generative-models/lib/python3.10/site-packages/torch/optim/adamw.py", line 566, in _multi_tensor_adamw denom = torch._foreach_add(exp_avg_sq_sqrt, eps) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 79.35 GiB total capacity; 77.04 GiB already allocated; 5.19 MiB free; 77.36 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

guoqincode commented 9 months ago

1. NVIDIA-SMI

470.57.02 Driver Version: 470.57.02 CUDA Version: 11.4 +-------------------------------+----------------------+----------------------+ | 6 NVIDIA A100-SXM... On | 00000000:C9:00.0 Off | 0 | | N/A 32C P0 65W / 400W | 3MiB / 81251MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 7 NVIDIA A100-SXM... On | 00000000:CF:00.0 Off | 0 | | N/A 30C P0 68W / 400W | 3MiB / 81251MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+

2. pip list

accelerate 0.25.0 aiohttp 3.9.1 aiosignal 1.3.1 altair 5.2.0 antlr4-python3-runtime 4.9.3 appdirs 1.4.4 asttokens 2.4.1 async-timeout 4.0.3 attrs 23.1.0 black 23.7.0 blinker 1.7.0 braceexpand 0.1.7 cachetools 5.3.2 certifi 2023.11.17 chardet 5.1.0 charset-normalizer 3.3.2 click 8.1.7 clip 1.0 cmake 3.27.9 contourpy 1.2.0 cycler 0.12.1 decorator 5.1.1 decord 0.6.0 diffusers 0.24.0 docker-pycreds 0.4.0 einops 0.7.0 exceptiongroup 1.2.0 executing 2.0.1 fairscale 0.4.13 filelock 3.13.1 fire 0.5.0 fonttools 4.46.0 frozenlist 1.4.0 fsspec 2023.12.1 ftfy 6.1.3 gitdb 4.0.11 GitPython 3.1.40 huggingface-hub 0.19.4 idna 3.6 imageio 2.33.1 importlib-metadata 6.11.0 invisible-watermark 0.2.0 ipython 8.18.1 jedi 0.19.1 Jinja2 3.1.2 jsonschema 4.20.0 jsonschema-specifications 2023.11.2 kiwisolver 1.4.5 kornia 0.6.9 lightning-utilities 0.10.0 lit 17.0.6 loralib 0.1.2 markdown-it-py 3.0.0 MarkupSafe 2.1.3 matplotlib 3.8.2 matplotlib-inline 0.1.6 mdurl 0.1.2 mpmath 1.3.0 multidict 6.0.4 mypy-extensions 1.0.0 natsort 8.4.0 networkx 3.2.1 ninja 1.11.1.1 numpy 1.26.2 nvidia-cublas-cu11 11.10.3.66 nvidia-cuda-cupti-cu11 11.7.101 nvidia-cuda-nvrtc-cu11 11.7.99 nvidia-cuda-runtime-cu11 11.7.99 nvidia-cudnn-cu11 8.5.0.96 nvidia-cufft-cu11 10.9.0.58 nvidia-curand-cu11 10.2.10.91 nvidia-cusolver-cu11 11.4.0.1 nvidia-cusparse-cu11 11.7.4.91 nvidia-nccl-cu11 2.14.3 nvidia-nvtx-cu11 11.7.91 omegaconf 2.3.0 open-clip-torch 2.23.0 opencv-python 4.6.0.66 packaging 23.2 pandas 2.1.3 parso 0.8.3 pathspec 0.11.2 pexpect 4.9.0 Pillow 10.1.0 pip 23.3.1 platformdirs 4.1.0 prompt-toolkit 3.0.41 protobuf 3.20.3 psutil 5.9.6 ptyprocess 0.7.0 pudb 2023.1 pure-eval 0.2.2 pyarrow 14.0.1 pydeck 0.8.1b0 Pygments 2.17.2 pyparsing 3.1.1 python-dateutil 2.8.2 pytorch-lightning 2.0.1 pytz 2023.3.post1 PyWavelets 1.5.0 PyYAML 6.0.1 referencing 0.31.1 regex 2023.10.3 requests 2.31.0 rich 13.7.0 rpds-py 0.13.2 safetensors 0.4.1 scipy 1.11.4 sentencepiece 0.1.99 sentry-sdk 1.38.0 setproctitle 1.3.3 setuptools 68.0.0 six 1.16.0 smmap 5.0.1 stack-data 0.6.3 streamlit 1.29.0 sympy 1.12 tenacity 8.2.3 tensorboardX 2.6 termcolor 2.4.0 timm 0.9.12 tokenizers 0.12.1 toml 0.10.2 tomli 2.0.1 toolz 0.12.0 torch 2.0.1 torchaudio 2.0.2 torchdata 0.6.1 torchmetrics 1.2.1 torchvision 0.15.2 tornado 6.4 tqdm 4.66.1 traitlets 5.14.0 transformers 4.32.0 triton 2.0.0 typing_extensions 4.8.0 tzdata 2023.3 tzlocal 5.2 urllib3 1.26.18 urwid 2.3.4 urwid-readline 0.13 validators 0.22.0 wandb 0.16.1 watchdog 3.0.0 wcwidth 0.2.12 webdataset 0.2.83 wheel 0.41.2 xformers 0.0.22 yarl 1.9.3 zipp 3.17.0

3. cmd

CUDA_VISIBLE_DEVICES=6,7 torchrun --nnodes=1 --nproc_per_node=2 train.py --config configs/training/train_stage_1.yaml

4. logs

Steps: 0%| | 27/30000 [00:24<7:22:00, 1.13it/s, lr=0.0001, step_loss=0.0587]Traceback (most recent call last): File "AnimateAnyone-unofficial/train.py", line 574, in main(name=name, launcher=args.launcher, use_wandb=args.wandb, config) File "AnimateAnyone-unofficial/train.py", line 468, in main scaler.step(optimizer) File "anaconda/envs/generative-models/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 374, in step retval = self._maybe_opt_step(optimizer, optimizer_state, *args, *kwargs) File "anaconda/envs/generative-models/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 290, in _maybe_opt_step retval = optimizer.step(args, kwargs) File "anaconda/envs/generative-models/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 69, in wrapper return wrapped(*args, kwargs) File "anaconda/envs/generative-models/lib/python3.10/site-packages/torch/optim/optimizer.py", line 280, in wrapper out = func(*args, *kwargs) File "anaconda/envs/generative-models/lib/python3.10/site-packages/torch/optim/optimizer.py", line 33, in _use_grad ret = func(self, args, kwargs) File "anaconda/envs/generative-models/lib/python3.10/site-packages/torch/optim/adamw.py", line 171, in step adamw( File "anaconda/envs/generative-models/lib/python3.10/site-packages/torch/optim/adamw.py", line 321, in adamw func( File "anaconda/envs/generative-models/lib/python3.10/site-packages/torch/optim/adamw.py", line 566, in _multi_tensor_adamw denom = torch._foreach_add(exp_avg_sq_sqrt, eps) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 79.35 GiB total capacity; 77.04 GiB already allocated; 5.19 MiB free; 77.36 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

You can contact me at guoqin@stu.pku.edu.cn and I will check this issue carefully when I have time

SmileTAT commented 9 months ago

1. NVIDIA-SMI

470.57.02 Driver Version: 470.57.02 CUDA Version: 11.4 +-------------------------------+----------------------+----------------------+ | 6 NVIDIA A100-SXM... On | 00000000:C9:00.0 Off | 0 | | N/A 32C P0 65W / 400W | 3MiB / 81251MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 7 NVIDIA A100-SXM... On | 00000000:CF:00.0 Off | 0 | | N/A 30C P0 68W / 400W | 3MiB / 81251MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+

2. pip list

accelerate 0.25.0 aiohttp 3.9.1 aiosignal 1.3.1 altair 5.2.0 antlr4-python3-runtime 4.9.3 appdirs 1.4.4 asttokens 2.4.1 async-timeout 4.0.3 attrs 23.1.0 black 23.7.0 blinker 1.7.0 braceexpand 0.1.7 cachetools 5.3.2 certifi 2023.11.17 chardet 5.1.0 charset-normalizer 3.3.2 click 8.1.7 clip 1.0 cmake 3.27.9 contourpy 1.2.0 cycler 0.12.1 decorator 5.1.1 decord 0.6.0 diffusers 0.24.0 docker-pycreds 0.4.0 einops 0.7.0 exceptiongroup 1.2.0 executing 2.0.1 fairscale 0.4.13 filelock 3.13.1 fire 0.5.0 fonttools 4.46.0 frozenlist 1.4.0 fsspec 2023.12.1 ftfy 6.1.3 gitdb 4.0.11 GitPython 3.1.40 huggingface-hub 0.19.4 idna 3.6 imageio 2.33.1 importlib-metadata 6.11.0 invisible-watermark 0.2.0 ipython 8.18.1 jedi 0.19.1 Jinja2 3.1.2 jsonschema 4.20.0 jsonschema-specifications 2023.11.2 kiwisolver 1.4.5 kornia 0.6.9 lightning-utilities 0.10.0 lit 17.0.6 loralib 0.1.2 markdown-it-py 3.0.0 MarkupSafe 2.1.3 matplotlib 3.8.2 matplotlib-inline 0.1.6 mdurl 0.1.2 mpmath 1.3.0 multidict 6.0.4 mypy-extensions 1.0.0 natsort 8.4.0 networkx 3.2.1 ninja 1.11.1.1 numpy 1.26.2 nvidia-cublas-cu11 11.10.3.66 nvidia-cuda-cupti-cu11 11.7.101 nvidia-cuda-nvrtc-cu11 11.7.99 nvidia-cuda-runtime-cu11 11.7.99 nvidia-cudnn-cu11 8.5.0.96 nvidia-cufft-cu11 10.9.0.58 nvidia-curand-cu11 10.2.10.91 nvidia-cusolver-cu11 11.4.0.1 nvidia-cusparse-cu11 11.7.4.91 nvidia-nccl-cu11 2.14.3 nvidia-nvtx-cu11 11.7.91 omegaconf 2.3.0 open-clip-torch 2.23.0 opencv-python 4.6.0.66 packaging 23.2 pandas 2.1.3 parso 0.8.3 pathspec 0.11.2 pexpect 4.9.0 Pillow 10.1.0 pip 23.3.1 platformdirs 4.1.0 prompt-toolkit 3.0.41 protobuf 3.20.3 psutil 5.9.6 ptyprocess 0.7.0 pudb 2023.1 pure-eval 0.2.2 pyarrow 14.0.1 pydeck 0.8.1b0 Pygments 2.17.2 pyparsing 3.1.1 python-dateutil 2.8.2 pytorch-lightning 2.0.1 pytz 2023.3.post1 PyWavelets 1.5.0 PyYAML 6.0.1 referencing 0.31.1 regex 2023.10.3 requests 2.31.0 rich 13.7.0 rpds-py 0.13.2 safetensors 0.4.1 scipy 1.11.4 sentencepiece 0.1.99 sentry-sdk 1.38.0 setproctitle 1.3.3 setuptools 68.0.0 six 1.16.0 smmap 5.0.1 stack-data 0.6.3 streamlit 1.29.0 sympy 1.12 tenacity 8.2.3 tensorboardX 2.6 termcolor 2.4.0 timm 0.9.12 tokenizers 0.12.1 toml 0.10.2 tomli 2.0.1 toolz 0.12.0 torch 2.0.1 torchaudio 2.0.2 torchdata 0.6.1 torchmetrics 1.2.1 torchvision 0.15.2 tornado 6.4 tqdm 4.66.1 traitlets 5.14.0 transformers 4.32.0 triton 2.0.0 typing_extensions 4.8.0 tzdata 2023.3 tzlocal 5.2 urllib3 1.26.18 urwid 2.3.4 urwid-readline 0.13 validators 0.22.0 wandb 0.16.1 watchdog 3.0.0 wcwidth 0.2.12 webdataset 0.2.83 wheel 0.41.2 xformers 0.0.22 yarl 1.9.3 zipp 3.17.0

3. cmd

CUDA_VISIBLE_DEVICES=6,7 torchrun --nnodes=1 --nproc_per_node=2 train.py --config configs/training/train_stage_1.yaml

4. logs

Steps: 0%| | 27/30000 [00:24<7:22:00, 1.13it/s, lr=0.0001, step_loss=0.0587]Traceback (most recent call last): File "AnimateAnyone-unofficial/train.py", line 574, in main(name=name, launcher=args.launcher, use_wandb=args.wandb, config) File "AnimateAnyone-unofficial/train.py", line 468, in main scaler.step(optimizer) File "anaconda/envs/generative-models/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 374, in step retval = self._maybe_opt_step(optimizer, optimizer_state, *args, *kwargs) File "anaconda/envs/generative-models/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 290, in _maybe_opt_step retval = optimizer.step(args, kwargs) File "anaconda/envs/generative-models/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 69, in wrapper return wrapped(*args, kwargs) File "anaconda/envs/generative-models/lib/python3.10/site-packages/torch/optim/optimizer.py", line 280, in wrapper out = func(*args, *kwargs) File "anaconda/envs/generative-models/lib/python3.10/site-packages/torch/optim/optimizer.py", line 33, in _use_grad ret = func(self, args, kwargs) File "anaconda/envs/generative-models/lib/python3.10/site-packages/torch/optim/adamw.py", line 171, in step adamw( File "anaconda/envs/generative-models/lib/python3.10/site-packages/torch/optim/adamw.py", line 321, in adamw func( File "anaconda/envs/generative-models/lib/python3.10/site-packages/torch/optim/adamw.py", line 566, in _multi_tensor_adamw denom = torch._foreach_add(exp_avg_sq_sqrt, eps) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 79.35 GiB total capacity; 77.04 GiB already allocated; 5.19 MiB free; 77.36 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

You can contact me at guoqin@stu.pku.edu.cn and I will check this issue carefully when I have time

I will reuse the environment of [magic-animate] first, and then check the code.

SmileTAT commented 9 months ago

torchrun --nnodes=1 --nproc_per_node=2 train.py --config configs/training/train_stage_1.yaml with A100-80G is OOM

The bug is from ReferenceNetAttention Class. Some Tensors on CUDA are not released which causes the memory to increase at each step.

guoqincode commented 9 months ago

torchrun --nnodes=1 --nproc_per_node=2 train.py --config configs/training/train_stage_1.yaml with A100-80G is OOM

The bug is from ReferenceNetAttention Class. Some Tensors on CUDA are not released which causes the memory to increase at each step.

Thanks for identifying the issue in ReferenceNetAttention Class. Could you create a pull request with your fix?

guoqincode commented 9 months ago

torchrun --nnodes=1 --nproc_per_node=2 train.py --config configs/training/train_stage_1.yaml with A100-80G is OOM

The bug is from ReferenceNetAttention Class. Some Tensors on CUDA are not released which causes the memory to increase at each step.

Thanks for identifying the issue in ReferenceNetAttention Class. Could you create a pull request with your fix?

I fixed the bug, thank you!

guoqincode / Open-AnimateAnyone