stage2 training error - Githubissues

Thank you for your work.

When I was in the second stage of training, I kept reporting out-of-memory errors. I have 80G of memory. No matter on a single card or multiple cards, the same error was reported. Even if --train_batch_size is set to 1, what went wrong?

error message: Traceback (most recent call last): File "/home/work/animate-anyone/train_2nd_stage.py", line 919, in main(args) File "/home/work/animate-anyone/train_2nd_stage.py", line 823, in main model_pred = unet( File "/home/work/AnimateAnyone-unofficial/animateanyone_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, kwargs) File "/home/work/AnimateAnyone-unofficial/animateanyone_env/lib/python3.10/site-packages/accelerate/utils/operations.py", line 632, in forward return model_forward(*args, *kwargs) File "/home/work/AnimateAnyone-unofficial/animateanyone_env/lib/python3.10/site-packages/accelerate/utils/operations.py", line 620, in call return convert_to_fp32(self.model_forward(args, kwargs)) File "/home/work/AnimateAnyone-unofficial/animateanyone_env/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 14, in decorate_autocast return func(*args, kwargs) File "/home/work/animate-anyone/animate_anyone/models/unet_3d_condition.py", line 1011, in forward sample = upsample_block( File "/home/work/AnimateAnyone-unofficial/animateanyone_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, *kwargs) File "/home/work/animate-anyone/animate_anyone/models/unet_3d_blocks.py", line 901, in forward hidden_states = resnet(hidden_states, temb, scale=lora_scale) File "/home/work/AnimateAnyone-unofficial/animateanyone_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(args, kwargs) File "/home/work/animate-anyone/animate_anyone/models/resnet.py", line 340, in forward hidden_states = self.norm1(hidden_states) File "/home/work/AnimateAnyone-unofficial/animateanyone_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/home/work/AnimateAnyone-unofficial/animateanyone_env/lib/python3.10/site-packages/torch/nn/modules/normalization.py", line 273, in forward return F.group_norm( File "/home/work/AnimateAnyone-unofficial/animateanyone_env/lib/python3.10/site-packages/torch/nn/functional.py", line 2530, in group_norm return torch.group_norm(input, num_groups, weight, bias, eps, torch.backends.cudnn.enabled) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 810.00 MiB (GPU 0; 79.35 GiB total capacity; 76.87 GiB already allocated; 64.19 MiB free; 77.66 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

guoqincode / Open-AnimateAnyone

stage2 training error #46