johndpope / VASA-1-hack

Using Claude Opus to reverse engineer code from VASA white paper - WIP - (this is for La Raza 🎷)
https://www.microsoft.com/en-us/research/project/vasa-1/
MIT License
210 stars 24 forks source link

Aligning the Disentanglement to whitepaper - as done in megaportraits #8

Closed johndpope closed 4 months ago

johndpope commented 5 months ago

using the megaportraits code from @kevinfringe - https://github.com/johndpope/MegaPortrait (with Claude fixes)

as context - Screenshot from 2024-04-28 17-04-34

**To achieve this, we base our model on the 3D-aid face reenactment framework from wang2021one; drobyshev2022megaportraits. The 3D appearance feature volume can better characterize the appearance details in 3D compared to 2D feature maps. The explicit 3D feature warping is also powerful in modeling 3D head and facial movements. Specifically, we decompose a facial image into a canonical 3D appearance volume 𝐕

, an identity code 𝐳

, a 3D head pose 𝐳

, and a facial dynamics code 𝐳

. Each of them is extracted from a face image by an independent encoder, except that 𝐕 is constructed by first extracting a posed 3D volume followed by rigid and non-rigid 3D warping to the canonical volume, as done in drobyshev2022megaportraits. A single decoder 𝒟 takes these latent variables as input and reconstructs the face image, where similar warping fields in the inverse direction are first applied to 𝐕

to get the posed appearance volume. Readers are referred to drobyshev2022megaportraits for more details of this architecture.**

Each of them is extracted from a face image by an independent encoder, except that 𝐕 is constructed by first extracting a posed 3D volume followed by rigid and non-rigid 3D warping to the canonical volume, as done in drobyshev2022megaportraits.


class Eapp1(nn.Module):
    def __init__(self):
        super().__init__()
        # ... (existing code)

    def forward(self, x):
        # ... (existing code)
        return v_can  # Return the canonical 3D appearance volume

class Eapp2(nn.Module):
    def __init__(self):
        super().__init__()
        # ... (existing code)

    def forward(self, x):
        # ... (existing code)
        return z_id  # Return the identity code

class HPE_EDE(nn.Module):
    def __init__(self):
        super().__init__()
        # ... (existing code)

    def forward(self, x):
        # ... (existing code)
        return z_pose, z_dyn  # Return the 3D head pose and facial dynamics code

class WarpGenerator(nn.Module):
    def __init__(self):
        super().__init__()
        # ... (implement warping generators)

    def forward(self, z_pose, z_dyn):
        # Generate warping fields w_s2c and w_c2d
        return w_s2c, w_c2d

class Decoder(nn.Module):
    def __init__(self):
        super().__init__()
        # ... (existing code)

    def forward(self, v_can, z_id, z_pose, z_dyn):
        # Apply warping fields to v_can to get the posed appearance volume
        v_posed = warp(v_can, w_s2c, w_c2d)

        # Pass the posed appearance volume, identity code, head pose, and facial dynamics to the decoder
        # ... (existing code)
        return face_image

The changes mentioned in the previous response would require modifications to several parts of the existing code to ensure proper integration and interaction. Here's an overview of how these components would interact with the other code:

Training Loop:

Update the training loop to pass the input images through the modified Eapp1, Eapp2, and HPE_EDE modules to extract the canonical 3D appearance volume (v_can), identity code (z_id), 3D head pose (z_pose), and facial dynamics code (z_dyn). Use the warping generators (WarpGenerator) to generate the warping fields (w_s2c and w_c2d) based on the head pose and facial dynamics. Pass the canonical volume, identity code, head pose, and facial dynamics to the updated decoder to reconstruct the face image. Calculate the loss functions based on the reconstructed face image and the ground truth.

Losses:

Modify the loss functions to account for the changes in the architecture. Update the perceptual loss (PerceptualLoss) to compare the reconstructed face image with the ground truth. Adapt the adversarial loss (GANLoss) to handle the reconstructed face image. Introduce additional losses specific to the 3D-aided face reenactment framework, such as losses for preserving identity, head pose, and facial dynamics.

Discriminator:

Update the discriminator (Discriminator) to take the reconstructed face image as input instead of the generated image from the previous architecture. Modify the discriminator's architecture if necessary to accommodate the changes in the generator.

Dataset and Data Loading:

Ensure that the dataset (FramesDataset) provides the necessary data for training the modified architecture, including the source and driving frames. Update the data loading process to handle any additional data requirements, such as 3D keypoints or annotations, if required by the 3D-aided face reenactment framework.

Inference and Evaluation:

Modify the inference and evaluation scripts to use the updated architecture. Extract the canonical 3D appearance volume, identity code, head pose, and facial dynamics from the source and driving frames during inference. Use the warping generators to generate the warping fields based on the extracted information. Pass the extracted information and warping fields to the decoder to generate the final output frames. Evaluate the generated frames using appropriate metrics and compare them with the ground truth.

These are the main areas where the modifications would interact with the existing code. You'll need to carefully review and update the relevant parts of the code to ensure seamless integration of the 3D-aided face reenactment framework into the VASA project. Keep in mind that the specific implementation details may vary depending on the requirements and design choices of the VASA project. It's important to refer to the original papers (wang2021one and drobyshev2022megaportraits) for more detailed information on the architectures, loss functions, and training procedures used in the 3D-aided face reenactment framework.

johndpope commented 5 months ago

I start this branch - I think we have to get MegaPortraits definitively working before this code gets off the ground. https://github.com/johndpope/VASA-1-hack/tree/MegaPortraits

I've rebuilt Megaportraits codebase https://github.com/johndpope/MegaPortrait-hack

I'm looking at training MegaPortraits - and a tiny bit put off by the effort to get data in shape. https://github.com/johndpope/MegaPortrait-hack/issues/2

trithucxx commented 5 months ago

I start this branch - I think we have to get MegaPortraits definitively working before this code gets off the ground. https://github.com/johndpope/VASA-1-hack/tree/MegaPortraits

I've rebuilt Megaportraits codebase https://github.com/johndpope/MegaPortrait-hack

I'm looking at training MegaPortraits - and a tiny bit put off by the effort to get data in shape. johndpope/MegaPortrait-hack#2

glad to see your update training progress

johndpope commented 5 months ago

with some help from Aleksey @Kwentar - progressing slowly. very grateful for extra eyes. https://github.com/johndpope/MegaPortrait-hack/issues/4

may have to write a blog about the process of using ai for help. reminds me of a podcast - this day in ai - where the guy was explaining his roomba robot vacum "cleaned" his carpet - only problem being the dog had shat and the poo went everywhere. https://podcasts.apple.com/sg/podcast/this-day-in-ai-podcast/id1671087656

kinda the same thing here.

@JZArray supposedly has implemented the paper already... maybe he can help get this VASA-1 one over the line. @JZArray - name your price. I need this code to align to Microsot VASA paper - boot up and start training.

@trithucxx - the keypoint detector in this codebase from Real3dportraits is erroneous. I guess it should be using same approach as megaportraits - resnet50 / resnet18 - though it would have been good if MS spelled this out more concretely. https://github.com/johndpope/VASA-1-hack/blob/main/Net.py#L90

Screenshot 2024-05-20 at 7 04 48 am
JZArray commented 5 months ago

@johndpope sry, I cannot share codes

johndpope commented 4 months ago

@trithucxx - I got the MegaPortrait base model to boot up into training loop with losses. I added breadcrumbs inside the model code of where I'm not sure of ai / paper 🤷 on specifics. There's a debug flag to show all the tensor sizes. Also considering simplifying the warping code - there's a ticket with specifics - that may help to give better results. https://github.com/johndpope/MegaPortrait-hack/blob/main/model.py#L231

Once EmoPortraits (with models / dataset) drops in less than 60 days - as an academic excercise I can circle back to cross check where things went wrong. https://github.com/neeek2303/EMOPortraits

johndpope commented 4 months ago

My model / training is finally converging - https://github.com/johndpope/MegaPortrait-hack/issues/36

I just need to fix gaze loss

johndpope commented 4 months ago

spent some time on this today - work in progress. https://github.com/johndpope/VASA-1-hack/pull/13