Memory required for training

linziqu commented 9 months ago

Thanks for your great work! I am trying to reproduce the result of magicanimate with ubcfashion dataset, but I find the white background is strange using the checkpoints from magicanimate as shown below, (reference image)

Secondly, when I set the resolution (512,512), video length 8, batch 1 and only train appearance net, the memory of v100 is almost full. So, I am curious about the training parameters for your setting, why you can set batch 2 for v100. Is it because I only have 2 V100s ?

jinxixiang commented 9 months ago

Hi,

From my understanding, the original magic animate weights do not show robust performance for the fashion dataset.

As for memory, more gpus might help.

CacacaLalala commented 8 months ago

Thanks for your great work! I am trying to reproduce the result of magicanimate with ubcfashion dataset, but I find the white background is strange using the checkpoints from magicanimate as shown below, (reference image)

Secondly, when I set the resolution (512,512), video length 8, batch 1 and only train appearance net, the memory of v100 is almost full. So, I am curious about the training parameters for your setting, why you can set batch 2 for v100. Is it because I only have 2 V100s ?

Hi! I also meet the GPU memory problem. The code sets FP16 for training, but the grad of parameter becomes NAN. And FP32 is fine. However V100 is not enough to train appearance encoder, controlnet and unet together. So did you meet the same problem?

linziqu commented 8 months ago

Hi! I meet the same error (NAN grad) when I only use one v100.

When I use two v100s with deepspeed, this issue does not arise. For memory, I add the "offload_optimizer": { "device": "cpu", "pin_memory": true }, to zero_config.json

发件人: Heyis @.> 发送时间: 2023年12月20日 22:21 收件人: jinxixiang/magic_animate_unofficial @.> 抄送: QU Linzi @.>; Author @.> 主题: [Ext] Re: [jinxixiang/magic_animate_unofficial] Memory required for training (Issue #3)

CAUTION: External email. Do not reply, click on links or open attachments unless you recognize the sender and know the content is safe.

Thanks for your great work! I am trying to reproduce the result of magicanimate with ubcfashion dataset, but I find the white background is strange using the checkpoints from magicanimate as shown below, (reference image) [image] https://private-user-images.githubusercontent.com/90315942/291025462-332657f4-dc42-4c4a-9fc5-ec0995428a46.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTEiLCJleHAiOjE3MDMwODIwOTksIm5iZiI6MTcwMzA4MTc5OSwicGF0aCI6Ii85MDMxNTk0Mi8yOTEwMjU0NjItMzMyNjU3ZjQtZGM0Mi00YzRhLTlmYzUtZWMwOTk1NDI4YTQ2LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFJV05KWUFYNENTVkVINTNBJTJGMjAyMzEyMjAlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjMxMjIwVDE0MTYzOVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWZmZTVlYTRmYTdkMDMxYzI0YmJmMzczNzRjODc2ZDdmNTliOTQ2YjgyMTg2YTAyN2M1NTg0ZjVjZWI2ZGQzYWYmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.jxyrHeKYqRjBIVfx09arrL7iBZt306vocc-UT123YE0

[image]https://private-user-images.githubusercontent.com/90315942/291025395-4425d9e2-2942-479e-94c7-b6e34c6450c3.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTEiLCJleHAiOjE3MDMwODIwOTksIm5iZiI6MTcwMzA4MTc5OSwicGF0aCI6Ii85MDMxNTk0Mi8yOTEwMjUzOTUtNDQyNWQ5ZTItMjk0Mi00NzllLTk0YzctYjZlMzRjNjQ1MGMzLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFJV05KWUFYNENTVkVINTNBJTJGMjAyMzEyMjAlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjMxMjIwVDE0MTYzOVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPThjMmRjZTViMTNiMDhlNjEzNWJhZWNjMDE0N2M1NTRiZmExMzAyMWY3NzBmMGJmNGJiZGJhOWY4ODJjYjI0MjQmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.OJ3H-LCTnHdxEavY_As6TI7I6dpUHRTaLuqkZ-cbPFU

Secondly, when I set the resolution (512,512), video length 8, batch 1 and only train appearance net, the memory of v100 is almost full. So, I am curious about the training parameters for your setting, why you can set batch 2 for v100. Is it because I only have 2 V100s ?

Hi! I also meet the GPU memory problem. The code sets FP16 for training, but the grad of parameter becomes NAN. And FP32 is fine. However V100 is not enough to train appearance encoder, controlnet and unet together. So did you meet the same problem?

― Reply to this email directly, view it on GitHubhttps://github.com/jinxixiang/magic_animate_unofficial/issues/3#issuecomment-1864556965, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AVRBZJWFZYWOEVOVXD5A4M3YKLX57AVCNFSM6AAAAABAYE7QOOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNRUGU2TMOJWGU. You are receiving this because you authored the thread.Message ID: @.***>

jinxixiang / magic_animate_unofficial

Memory required for training #3