Some trouble with running the training code

JunbongJang commented 9 months ago

Thank you for your great work! Since your training code takes pose as input, I had an impression that this is closer to another concurrent work, Animate Anyone.

Arxiv Paper: https://arxiv.org/pdf/2311.17117.pdf

Anyway, I am trying to train using RTX 3090 GPU but I get out of memory error at model, optimizer = accelerator.prepare(model, optimizer) in train.py

So I took that line out and trained the model for about 70000 steps but I get the same validation grid video for different validation steps. So I wonder if the weights are being updated at all...

sample-70000

Did you also experience the similar problem?

jinxixiang commented 9 months ago

Hi, thank you for your interests.

It's crucial not to remove accelerate.prepare() as it's used in conjunction with accelerate.backward() for gradient updates. Without it, your model won't be optimized at all. For more details, you can refer to the Hugging Face documentation at

https://huggingface.co/docs/accelerate/index.

If you're using a single 3090, you may encounter Out Of Memory errors. This is because DeepSpeed's zero2 feature splits optimization across multiple GPUs to minimize memory usage. Therefore, it's recommended to use multiple GPUs.

For your reference, our setup includes 16 V100-32G GPUs, with a batch size of 2 per GPU.

We've also included some intermediate results at the 10,000-step. There are some artifacts, but the model is still in the optimization phase, and we plan to release the weights at a later date.

video_1224_clip_2

video_361_clip_8

video_1228_clip_1

JunbongJang commented 9 months ago

Thank you for the quick reply! Your intermediate results are encouraging. I will try using cloud GPU with more memory.

jinxixiang / magic_animate_unofficial

Some trouble with running the training code #1