train stage 1 oom - Githubissues

maobj commented 8 months ago

Great work! We try to train the model of stage 1. However, we had an OOM error. The four GPUs are A100 of 80g, and the config yaml is : train_batch_size=4, sample_size: 512 # for 40G 256 sample_stride: 4, sample_n_frames: 16mixed_precision_training: False enable_xformers_memory_efficient_attention: False We run the code by torchrun --nnodes=1 --nproc_per_node=4 train_hack.py --config configs/training/train_stage_1.yaml Any idea to help solve this problem?

guoqincode commented 8 months ago

I feel strange, I am at 512 resolution, single card bs=8.

maobj commented 7 months ago

I feel strange, I am at 512 resolution, single card bs=8.

当我把"sample_n_frames: 16mixed_precision_training" and "enable_xformers_memory_efficient_attention"设置为True后，可以正常跑起来。但是还有个问题，在train_hack.py里设置poseguider的输出为320维（poseguider = PoseGuider(noise_latent_channels=320)），但是在推理的时候python3 -m pipelines.animation_stage_1 --config configs/prompts/animation_stage_1.yaml，加载poseguider模型时，设置的channel是4，model = PoseGuider(noise_latent_channels=4)，而且在后续的pipeline 里面使用的方式对应的也是channel为4的情况（latent_model_input = self.scheduler.scale_model_input(latent_model_input, t) + latents_pose）。这个是为什么呢？好像是推理的时候没有用你的hackunet结构？

guoqincode commented 7 months ago

I feel strange, I am at 512 resolution, single card bs=8.

当我把"sample_n_frames: 16mixed_precision_training" and "enable_xformers_memory_efficient_attention"设置为True后，可以正常跑起来。但是还有个问题，在train_hack.py里设置poseguider的输出为320维（poseguider = PoseGuider(noise_latent_channels=320)），但是在推理的时候python3 -m pipelines.animation_stage_1 --config configs/prompts/animation_stage_1.yaml，加载poseguider模型时，设置的channel是4，model = PoseGuider(noise_latent_channels=4)，而且在后续的pipeline 里面使用的方式对应的也是channel为4的情况（latent_model_input = self.scheduler.scale_model_input(latent_model_input, t) + latents_pose）。这个是为什么呢？好像是推理的时候没有用你的hackunet结构？

你可以使用最新的demo文件夹里的gradio模式推理，更加方便

hkunzhe commented 7 months ago

@guoqincode, the batch size in the stage1 is 64 in the original paper. Even if I set enable_xformers_memory_efficient_attention=True and use 8 A100 80G, the train batch size is 32. Does this affect the effectiveness of stage 1 training?

LeonJoe13 commented 7 months ago

I feel strange, I am at 512 resolution, single card bs=8.

当我把"sample_n_frames: 16mixed_precision_training" and "enable_xformers_memory_efficient_attention"设置为True后，可以正常跑起来。但是还有个问题，在train_hack.py里设置poseguider的输出为320维（poseguider = PoseGuider(noise_latent_channels=320)），但是在推理的时候python3 -m pipelines.animation_stage_1 --config configs/prompts/animation_stage_1.yaml，加载poseguider模型时，设置的channel是4，model = PoseGuider(noise_latent_channels=4)，而且在后续的pipeline 里面使用的方式对应的也是channel为4的情况（latent_model_input = self.scheduler.scale_model_input(latent_model_input, t) + latents_pose）。这个是为什么呢？好像是推理的时候没有用你的hackunet结构？

你好我也遇到了相同的问题，请问解决了嘛

LeonJoe13 commented 7 months ago

I feel strange, I am at 512 resolution, single card bs=8.

当我把"sample_n_frames: 16mixed_precision_training" and "enable_xformers_memory_efficient_attention"设置为True后，可以正常跑起来。但是还有个问题，在train_hack.py里设置poseguider的输出为320维（poseguider = PoseGuider(noise_latent_channels=320)），但是在推理的时候python3 -m pipelines.animation_stage_1 --config configs/prompts/animation_stage_1.yaml，加载poseguider模型时，设置的channel是4，model = PoseGuider(noise_latent_channels=4)，而且在后续的pipeline 里面使用的方式对应的也是channel为4的情况（latent_model_input = self.scheduler.scale_model_input(latent_model_input, t) + latents_pose）。这个是为什么呢？好像是推理的时候没有用你的hackunet结构？

你好我也遇到了相同的问题，请问解决了嘛

已解决，在初始化pose guider时要初始化channel为320

guoqincode / Open-AnimateAnyone

train stage 1 oom #77