About the weight and bias of Conv_in

open-mmlab / PIA

[CVPR 2024] PIA, your Personalized Image Animator. Animate your images by text prompt, combing with Dreambooth, achieving stunning videos. PIA，你的个性化图像动画生成器，利用文本提示将图像变为奇妙的动画

https://pi-animator.github.io/

Apache License 2.0

808 stars 67 forks source link

About the weight and bias of Conv_in #32

Closed Tianhao-Qi closed 6 months ago

Tianhao-Qi commented 6 months ago

@zengyh1900 @hellock @eltociear As your paper mentions, you will keep the first 4 channels of conv_in layer in the first dimension frozen, however, if you compare the weights and biases of sd v1.5 and pia you provide, they are actually not the same!

LeoXing1996 commented 6 months ago

Hey @Tianhao-Qi, before video training, we finetune the image UNet on WebVid dataset. The first 4 channels comes from the finetuned UNet, which makes it different from original SD15 one.

Tianhao-Qi commented 6 months ago

Thanks for your reply, what's the benefit of finetuning the image unet on WebVid dataset? I haven't seen any mention in your paper.

LeoXing1996 commented 6 months ago

@Tianhao-Qi T, we introduced our training method in section 3.3. Following the training strategy of animatediff, we first train a domain adapter on webvid. As animatediff has not released the weights for their LoRA version of the domain adapter, we directly fine-tune the entire UNet, transforming it into a 'domain adapter' for webvid.

ernestchu commented 1 month ago

@LeoXing1996 If you used the fine-tuned UNet, doesn't it mean that the generated videos inherit the low visual quality of the video dataset?