GPU memory usage for training

Vincent-luo commented 3 months ago

I have a question about the GPU memory usage for model training. I'm using a V100 32GB GPU, but I'm encountering "CUDA out of memory" errors when training for the first stage with default setting. This happens even when I set the gradient_accumulation_steps to 1. I would like to know how much VRAM is really needed for model training. I'm not sure if there's something wrong in my setup because your paper mentions that you also used V100 GPUs for training.

lhd777 commented 3 months ago

Megactor requires just 32GB of VRAM for training. In fact, our experimental setup during training consisted of 8 V100 GPUs. If you encounter a GPU Out of Memory situation, there could be several reasons for this, such as other processes occupying the GPU. The simplest way to train with fewer memory issues is to reduce the video length:

https://github.com/megvii-research/megactor/blob/main/configs/train/train_stage1.yaml#L61

Or you can turn off the motion layer in your 2D traing stage, and then turn on the motion layer in 3D training stage (The open-source version is a little bit differnce from our paper, because we find it's also ok for training 2D & 3D at the same time. Your can train megactor on your favorite.):

https://github.com/megvii-research/megactor/blob/main/configs/train/train_stage1.yaml#L22

Vincent-luo commented 3 months ago

Thanks for your quick reply! I'm not very familiar with the deepspeed setting, should I uncomment these lines, it seems the training doesn't use mixed_precision: fp16 https://github.com/megvii-research/megactor/blob/16e7cdf059c93475cd8edbb1d597fb1954620333/configs/accelerate_deepspeed.yaml#L23-L35

YZX-codesky commented 3 weeks ago

Hello, have you succeeded in replicating? When I was processing the data set, there was a problem in the fourth part, the size of the generated swapped.mp4 videos are all 0, can you share the videos you generated?

megvii-research / megactor

GPU memory usage for training #24