Tensor size mismatch in using clip-vit-large-patch14

zhenzhiwang commented 9 months ago

Hi,

Thanks for sharing your implementation. It really helps the community a lot to reproduce animate-anyone. When I try to training the network with your code, I find that in the referencenet_attention, the hidden state size of stable diffusion unet is 768 while the clip image feature extracted from clip-vit-large-patch14 is 1024, which causes size mismatch in network forward (however, the hidden size of clip-vit-base-patch32 is 768). As your config yaml file was clip-vit-base-patch32 and recently change to clip-vit-large-patch14, and you mentioned that you use clip-vit-large-patch14 in another issue. Could you elaborate more details how your code works with clip-vit-large-patch14? I encountered errors when I directly run your training code with clip-vit-large-patch14.

Looking forward to your reply! Thanks again for your effort.

guoqincode commented 9 months ago

I tried two Image CLIP Encoders:

clip-base
clip-large

If use the clip-large, need to add a layer of Linear, you can refer to: https://github.com/tencent-ailab/IP-Adapter/blob/main/ip_adapter/ip_adapter.py#L28

zhenzhiwang commented 9 months ago

Thanks for your quick reply!

It is aligned with my knowledge. I was wondering why you use clip-large yet without any linear projection layer (as your code shows).

By the way, could you share how many total iterations, learning rates, training batch size and training image resolutions in stage 1 which leads to satisfactory results? Because I directly use your code and training config (with minimal modifications, such as lr=1e-5 as animate anyone paper says) and get meaningless images similar to https://github.com/guoqincode/AnimateAnyone-unofficial/issues/14#issuecomment-1855521920 in both UBC and TikTok dataset. Could you update your latest code and config which leads to satisfactory results to this repo?

Thanks a lot!

guoqincode commented 9 months ago

You can email me at guoqin@stu.pku.edu.cn and we can add WeChat. My current code is slightly different from that in the repo. My current machine cannot connect to the external network, so it cannot be updated in time.

guoqincode / Open-AnimateAnyone

Tensor size mismatch in using clip-vit-large-patch14 #27