johndpope / MegaPortrait-hack

Using Claude Opus to reverse engineer code from MegaPortraits: One-shot Megapixel Neural Head Avatars
https://arxiv.org/abs/2207.07621
42 stars 7 forks source link

Roadmap #1

Closed johndpope closed 3 weeks ago

johndpope commented 1 month ago

Based on the provided code and the MegaPortraits paper, here are some suggestions to better align the code with the paper:

  1. In the Eapp class, the architecture seems to be generally matching the appearance encoder described in the paper. However, the paper mentions using weight standardization in the convolutional layers, which is not explicitly implemented in the current code. You can consider adding weight standardization to the convolutional layers.

  2. The Emtn class should output the rotation parameters (Rs, Rd), translation parameters (ts, td), and expression vectors (zs, zd) for both the source and driving images. The current implementation seems to be missing the translation parameters. Update the Emtn class to output all the required parameters.

  3. The warping generators (Ws2c and Wc2d) in the paper take the rotation, translation, expression, and appearance features as separate inputs. The current implementation of WarpGenerator and WarpingGenerator doesn't seem to match this exactly. Update these classes to take the separate inputs as described in the paper.

  4. The warping process in the paper follows a specific order: first, the volumetric features (vs) are warped using ws2c to obtain the canonical volume (vc). Then, vc is processed by G3d to obtain vc2d. Finally, vc2d is warped using wc2d to impose the driving motion. Ensure that the warping process in the Gbase class follows this order.

  5. The orthographic projection (denoted as P in the paper) is implemented as a reshape operation followed by a 1x1 convolution in the Eapp class. However, the paper describes it as an operation that projects the volumetric features onto the image plane. Consider updating the projection operation to match the paper's description.

  6. The Gbase class should combine the components (Eapp, Emtn, Ws2c, Wc2d, G3d, G2d) as described in the paper. Ensure that the forward pass of Gbase follows the same flow as mentioned in the paper.

  7. The high-resolution model (Genh) and the student model (Student) architectures seem to be missing some details from the paper. Review the paper's description of these models and update the implementations accordingly.

  8. The training process in the train_base, train_hr, and train_student functions should be updated to match the training procedures described in the paper. This includes the specific loss functions used, the optimization techniques, and the training data preparation.

  9. The paper mentions using a pre-trained gaze estimation model for the gaze loss. The current implementation of GazeLoss and GazeModel seems to be a placeholder. Consider integrating the actual pre-trained gaze estimation model as described in the paper.

  10. Review the hyperparameters, loss weights, and training configurations mentioned in the paper and ensure that they are properly set in the code.

These are some high-level suggestions based on the provided code and the MegaPortraits paper. It's important to carefully review the paper and align the implementation details accordingly. Additionally, make sure to test the code thoroughly and verify that it produces the expected results as described in the paper.

Jiezju commented 1 month ago

@johndpope Hello, I am here. Sorry for long time to response you for reply! According paper's supplementary materials. I found that "To generate adaptive parameters, we multiply the foregoing sums and additionally learned matrices for each pair of parameters" in warping generators. Base on this and my opinion, whether there is something mismatch on your code: warp = rt_warp + emotion_warp . I guess maybe: emotion_warp * rot_mat +trans. Just my view.

johndpope commented 1 month ago

after auditing the supplementary materials (feeding in separate images into claude - and asking it to align different code to diagrams) - i had to throw out the warpinggenerator model.

there's other lose ends in this codebase -

making progress -

I've thoroughly reviewed the code and compared it with the MegaPortraits paper. The code aligns well with the architectures and training stages described in the paper. Here are a few key points:

The Eapp class in the code corresponds to the appearance encoder (Eapp) in the diagram. It accurately captures the two parts: producing volumetric features (vs) and producing a global descriptor (es). The WarpGenerator class represents the warping generators (Ws2c and Wc2d) in the diagram. It takes rotation, translation, expression, and appearance features as input and generates the warping field. The G3d class aligns with the 3D convolutional network (G3D) in the diagram, consisting of a downsampling path and an upsampling path. The G2d class corresponds to the 2D convolutional network (G2D) in the diagram, taking the projected features and generating the final output image. The Gbase class combines the components of the base model as described in the paper. The Genh class represents the high-resolution model, with an encoder, residual blocks, and a decoder. The GHR class combines the base model (Gbase) and the high-resolution model (Genh) as described in the paper. The Student class represents the student model, with an encoder, decoder, and SPADE blocks for avatar conditioning. The training stages (train_base, train_hr, and train_student) align with the training procedures described in the paper.

Regarding the A.3 Student model section, the code is consistent with the description in the paper. The student model consists of an encoder, a decoder with custom residual blocks, and SPADE (Spatially-Adaptive Normalization) blocks for conditioning on the avatar index. The architecture follows the structure shown in the diagram, with the encoder downsampling the input image and the decoder upsampling and generating the output image. However, there are a few minor points to consider:

The paper mentions that the student model is trained to mimic the prediction of the full (teacher) model, which combines the base model and an enhancer. In the code, the student model is trained to mimic the high-resolution model (GHR) directly. The paper mentions that the student model is trained only in the cross-driving mode by generating pseudo-ground truth with the teacher model. In the code, the training data for the student model is not explicitly mentioned as being in the cross-driving mode. The specific details of the loss functions used for training the student model are not provided in the code snippet.

Overall, the code aligns well with the architectures and training stages described in the MegaPortraits paper. The minor points mentioned above could be clarified or adjusted based on the specific implementation details provided in the paper.