Open wangshiwen-ai opened 3 months ago
Hi @wangshiwen-ai
the referencenet is a mess at the moment. I just pushed some code and it's a problem.
the referencenet - I collapse the training part so it's just extract feature maps. But now I'm having doubts.
from the image - there's no full pass through but then reading this - the extraction is happening on their trained model with millions of images. so we still need training stage.....
https://blog.metaphysic.ai/plausible-stable-diffusion-video-from-a-single-image/ - they are operating in tandem.
I'm not clear on
https://github.com/johndpope/Emote-hack/blob/main/train_stage_1_0.py#L194 (this needs updating) https://github.com/johndpope/Emote-hack/blob/main/Net.py#L53C4-L65C31
I had this function to extract them. https://github.com/johndpope/Emote-hack/issues/27 q) do we need the motion frames to be black and white? (the diffusedheads did this. I'm going through their forks to find a working remote branch.) q) should the images be resized to 256x256 from 512? I think not, right? this results in much smaller vae encode image - 32x32 ? reading this - https://blog.metaphysic.ai/plausible-stable-diffusion-video-from-a-single-image/ it seems like the size of videos are 512x512. q) concatenating the reference + motions frames - I drafted this - but I need to successfully pass the whole tensor through the first layer of referencenet to just spit out features - then it's on to the backbone and 'injection into the resolution stratum'.
Related - this is saying throw out the Reference Attention and use EMSA - Efficient Multi-Head Self-Attention (EMSA) https://github.com/johndpope/Emote-hack/issues/16
crazy day - some open ended questions I don't have answers. Following up on a query from @sarperkilic I was focusing on referencenet today. There's 3 different pathways - not sure which is best route.
Some observations
Is the diagram wrong? does RefernceNet actually use self attention?
or is that just the nature of unet?
q) why didn't that just say unet - sd instead of referenenet.
There's a way to inject self attention into unet - i push as new model. https://github.com/johndpope/Emote-hack/commit/908ad9ab59b9307390595388466659d2475e3a01
or there's a simpler way to upgrade referencenet to use self attention layers. https://github.com/johndpope/Emote-hack/blob/main/Net.py#L66
When I looked at the VideoNet by @jimmyl02 - it iintroduce reference attention when the model loads. https://github.com/jimmyl02/animate/blob/main/animate-anyone/models/videonet.py
the other option from earlier is just to extract the features from unet. that is just a function call. https://github.com/johndpope/Emote-hack/blob/main/Net.py#L87
If ReferenceNet didn't need updating - then it should frozen too. This doc explains quite well that the ReferenceNet model was trained on huge amount of data.
https://blog.metaphysic.ai/plausible-stable-diffusion-video-from-a-single-image/
so I introduce that as preprocessing) stage 1 (but not matching up to diagram) https://github.com/johndpope/Emote-hack/blob/main/train_stage_1_referencenet.py
Hi,
one thing I am not sure,
the way you pushed is passing latent_representations through the whole reference UNet https://github.com/johndpope/Emote-hack/commit/908ad9ab59b9307390595388466659d2475e3a01
the diagram from the paper showed that input of backbone network should come from the first down block layer of reference net, not from the end.
@sarperkilic thats this option https://github.com/johndpope/Emote-hack/blob/main/Net.py#L123 but then there's no need for self attention? maybe thats the right option. will sleep on it.
@wangshiwen-ai / @sarperkilic \ been working on another paper recreation - https://arxiv.org/abs/2207.07621 https://github.com/johndpope/MegaPortrait-hack
N.B. - similar code is actually slated for release in a couple of months
I could use some extra eyes - code is mostly complete - but have troubles with warpgenerator.
The key point of he mat dimension is that the cross_attention_dim is 768 default but we do not need the text embedding. So I think we may modified the cross attention layers and ignore the cross attention in reference net.
In denoising net, I think it should take the noisy image latents as input and cross attentioned with reference features to reconstruct the reference images. And the unet in stage 1 may be all 2D.