johndpope / MegaPortrait-hack

Using Claude Opus to reverse engineer code from MegaPortraits: One-shot Megapixel Neural Head Avatars
https://arxiv.org/abs/2207.07621
42 stars 7 forks source link

PR key: zs&es vectors, change model outputs and structure #23

Closed JackAILab closed 4 weeks ago

JackAILab commented 4 weeks ago

We have recently been replicating this work and have obtained some preliminary results. From our perspective, Emtn and Eapp function as a class of encoders designed for ID and motion, where es and zs are used to characterize motion features, and vs is used to characterize ID features. There should be a very important parameter, COMPRESS_DIM (TODO 1) (not explicitly stated in the paper, we tentatively consider 512 to be a more reasonable compression dimension).

Here, first, (TODO 2) fc_layer es is a global descriptor, its shape is aligned with the expression feature vector zs, which means the global descriptor should also be a vector, with a shape of (bs, vector_dim).

However, in the current version of the code iteration, es and zs seem to be always treated as a feature matrix, with a shape of (bs, vector_dim, Height, Width). We have already made this change in the PR.

Secondly, (TODO 3, zs_sum, ) according to the description of the paper (Page11: To generate adaptive parameters, we multiply the foregoing sums and additionally learned matrices for each pair of parameters.), adaptive_matrix_gamma should be retained. It is not used to change the shape, but can generate learning parameters, which is more reasonable than just using sum. Therefore, we have supplemented adaptive_matrix_gamma to the correct position in the PR.

By the way, your operation of the 3D swarpping operator is very nice, we have defined it in the same way (code apply_warping_field function - F.grid_sample()). We appreciate the great contribution from your team and look forward to perfectly replicating this work together!

johndpope commented 4 weeks ago

Hi Jiehui, thanks so much again. can you please check the code ? I merged PR but it broke. maybe something else missing? I have to go out today - won't be online till later tonight. Screenshot from 2024-05-29 07-04-20

JackAILab commented 4 weeks ago

@johndpope No missing code submissions have been found yet, but I have added instructions for the data. You need to make sure that es_resnet/expression_resnet.shape is (bs, 512, 2, 2). You may also need to further check and merge your code (if necessary).

flyingshan commented 4 weeks ago

@johndpope No missing code submissions have been found yet, but I have added instructions for the data. You need to make sure that es_resnet/expression_resnet.shape is (bs, 512, 2, 2). You may also need to further check and merge your code (if necessary).

May I ask whether you adopt the network structure of Eapp/G3D etc. from the current repo or implement them by yourself according to the paper? I observe some differences from the original paper in this repo (such as reduce one avgpool in Eapp, different structure of G3D, etc.). I wonder the influence of these changes. Thank you!

johndpope commented 4 weeks ago

@JackAILab Were there any changes to custom resnet.py code ? Did you restore layer4?

johndpope commented 4 weeks ago

from your comment - when I run with 2 - instead of 4 - it's running. thanks. FEATURE_SIZE_AVG_POOL = 2 # 🤷 these should align FEATURE_SIZE = (2, 2) # 🤷 1x1? 4x4? idk

JackAILab commented 4 weeks ago

@johndpope Yes, I overlooked that detail and was just about to bring it up. haha, Your work efficiency is truly impressive!

johndpope commented 1 week ago

did you end up swapping in VAE for this

was thinking maybe using this vae = AutoencoderKL.from_pretrained("stabilityai/sd-vae-ft-mse")


        ### TODO 2: Change vs/es here for vector size
        According to the description of the paper (Page11: predict the head pose and expression vector), 
        zs should be a global descriptor, which is a vector. Otherwise, the existence of Emtn and Eapp is of little significance. 
        The output feature is a matrix, which means it is basically not compressed. This encoder can be completely replaced by a VAE.
        '''        
        filters = [64, 256, 512, 1024, 2048]
        outputs=COMPRESS_DIM
        self.fc = torch.nn.Linear(filters[4], outputs)