johndpope / Emote-hack

Emote Portrait Alive - using ai to reverse engineer code from white paper. (abandoned)
https://github.com/johndpope/VASA-1-hack
164 stars 5 forks source link

I found some bugs and questions. #30

Open wangshiwen-ai opened 3 months ago

wangshiwen-ai commented 3 months ago

The key point of he mat dimension is that the cross_attention_dim is 768 default but we do not need the text embedding. So I think we may modified the cross attention layers and ignore the cross attention in reference net.

In denoising net, I think it should take the noisy image latents as input and cross attentioned with reference features to reconstruct the reference images. And the unet in stage 1 may be all 2D.

johndpope commented 3 months ago

Hi @wangshiwen-ai

the referencenet is a mess at the moment. I just pushed some code and it's a problem.

the referencenet - I collapse the training part so it's just extract feature maps. But now I'm having doubts.
from the image - there's no full pass through but then reading this - the extraction is happening on their trained model with millions of images. so we still need training stage..... https://blog.metaphysic.ai/plausible-stable-diffusion-video-from-a-single-image/ - they are operating in tandem.

I'm not clear on

https://github.com/johndpope/Emote-hack/blob/main/train_stage_1_0.py#L194 (this needs updating) https://github.com/johndpope/Emote-hack/blob/main/Net.py#L53C4-L65C31

I had this function to extract them. https://github.com/johndpope/Emote-hack/issues/27 q) do we need the motion frames to be black and white? (the diffusedheads did this. I'm going through their forks to find a working remote branch.) q) should the images be resized to 256x256 from 512? I think not, right? this results in much smaller vae encode image - 32x32 ? reading this - https://blog.metaphysic.ai/plausible-stable-diffusion-video-from-a-single-image/ it seems like the size of videos are 512x512. q) concatenating the reference + motions frames - I drafted this - but I need to successfully pass the whole tensor through the first layer of referencenet to just spit out features - then it's on to the backbone and 'injection into the resolution stratum'.

Related - this is saying throw out the Reference Attention and use EMSA - Efficient Multi-Head Self-Attention (EMSA) https://github.com/johndpope/Emote-hack/issues/16

johndpope commented 3 months ago

crazy day - some open ended questions I don't have answers. Following up on a query from @sarperkilic I was focusing on referencenet today. There's 3 different pathways - not sure which is best route.

Some observations

architecture

Is the diagram wrong? does RefernceNet actually use self attention?
or is that just the nature of unet? q) why didn't that just say unet - sd instead of referenenet.

There's a way to inject self attention into unet - i push as new model. https://github.com/johndpope/Emote-hack/commit/908ad9ab59b9307390595388466659d2475e3a01

or there's a simpler way to upgrade referencenet to use self attention layers. https://github.com/johndpope/Emote-hack/blob/main/Net.py#L66

When I looked at the VideoNet by @jimmyl02 - it iintroduce reference attention when the model loads. https://github.com/jimmyl02/animate/blob/main/animate-anyone/models/videonet.py

the other option from earlier is just to extract the features from unet. that is just a function call. https://github.com/johndpope/Emote-hack/blob/main/Net.py#L87

If ReferenceNet didn't need updating - then it should frozen too. This doc explains quite well that the ReferenceNet model was trained on huge amount of data.

https://blog.metaphysic.ai/plausible-stable-diffusion-video-from-a-single-image/

so I introduce that as preprocessing) stage 1 (but not matching up to diagram) https://github.com/johndpope/Emote-hack/blob/main/train_stage_1_referencenet.py

sarperkilic commented 3 months ago

Hi,

one thing I am not sure,

the way you pushed is passing latent_representations through the whole reference UNet https://github.com/johndpope/Emote-hack/commit/908ad9ab59b9307390595388466659d2475e3a01

the diagram from the paper showed that input of backbone network should come from the first down block layer of reference net, not from the end.

johndpope commented 3 months ago

@sarperkilic thats this option https://github.com/johndpope/Emote-hack/blob/main/Net.py#L123 but then there's no need for self attention? maybe thats the right option. will sleep on it.

johndpope commented 1 month ago

@wangshiwen-ai / @sarperkilic \ been working on another paper recreation - https://arxiv.org/abs/2207.07621 https://github.com/johndpope/MegaPortrait-hack

N.B. - similar code is actually slated for release in a couple of months

I could use some extra eyes - code is mostly complete - but have troubles with warpgenerator.