johndpope / MegaPortrait-hack

Using Claude Opus to reverse engineer code from MegaPortraits: One-shot Megapixel Neural Head Avatars
https://arxiv.org/abs/2207.07621
42 stars 7 forks source link

ResBlock_Custom Mistake? #4

Closed Kwentar closed 1 month ago

Kwentar commented 1 month ago

https://github.com/johndpope/MegaPortrait-hack/blob/ff64b9421b1a37b652890ebb93a6303bb74baddb/model.py#L139

Hi, in article in section with Eapp description we have "The architecture of our residual block can be seen in Figure 11 (c), where 𝑛 denotes the dimension of a convolutional layer (either 2D or 3D) and x denotes the number of output channels.", but your ResBlock_Custom looks like block from Figure 10 (c), this one is different, is it mistake or I don't understand you correctly?

Btw I am currently started to do the same as you, I could make PR on your repo, if it is suitable for you

johndpope commented 1 month ago

thanks for pointing out this - I double check it. when I was hacking out the EMO paper, I asked AI to add assertions to code https://github.com/search?q=repo%3Ajohndpope%2FEmote-hack%20assert&type=code

this helped to force it to explain it's inner thinking. it clogs up the code - but can be very helpful. The amazing thing with the MegaPortraits paper - is they explain the architecture.

BONUS ask it to find improvements. https://github.com/johndpope/Emote-hack/issues/16

the training code will boot up - and hit a dimension error on the models - there's gotta be a way to nurse this thing to cycle through a training loop.

BONUS BONUS you can use this to dump all python classes into one file for context.


cat ~/dump.sh 
#!/bin/bash

# Target file to store all Python files' contents
output="all.py"

# Check if the output file already exists and delete it if it does
if [ -f "$output" ]; then
    rm "$output"
fi

# Find all Python files in current and subdirectories, then append their contents and a newline to the output file
find . -name '*.py' -exec sh -c 'cat "{}" >> $1; echo "" >> $1' _ $output \;

echo "All Python files have been concatenated into $output"

Screenshot from 2024-05-16 08-32-40

In this updated implementation:

The residual block consists of two convolutional layers (conv1 and conv2) with group normalization (norm1 and norm2) and ReLU activations. The residual connection is implemented using a shortcut convolutional layer when the input and output channels are different, and an identity mapping otherwise. The dimension parameter determines whether 2D or 3D convolutional layers are used.

These changes align the residual block implementation with the architecture described in the text and shown in Figure 11(c). Please note that I made these updates based on the information provided in the text and the discrepancy you pointed out. It's always recommended to refer to the original paper or codebase for the most accurate and authoritative implementation details.


class ResBlock_Custom(nn.Module):
    def __init__(self, dimension, input_channels, output_channels):
        super().__init__()
        self.dimension = dimension
        self.input_channels = input_channels
        self.output_channels = output_channels
        if dimension == 2:
            self.conv1 = nn.Conv2d(self.input_channels, self.output_channels, 3, padding=1)
            self.conv2 = nn.Conv2d(self.output_channels, self.output_channels, 3, padding=1)
        elif dimension == 3:
            self.conv1 = nn.Conv3d(self.input_channels, self.output_channels, 3, padding=1)
            self.conv2 = nn.Conv3d(self.output_channels, self.output_channels, 3, padding=1)

        self.norm1 = nn.GroupNorm(num_groups=32, num_channels=self.output_channels)
        self.norm2 = nn.GroupNorm(num_groups=32, num_channels=self.output_channels)

        if self.input_channels != self.output_channels:
            if dimension == 2:
                self.shortcut = nn.Conv2d(self.input_channels, self.output_channels, 1)
            elif dimension == 3:
                self.shortcut = nn.Conv3d(self.input_channels, self.output_channels, 1)
        else:
            self.shortcut = nn.Identity()

    def forward(self, x):
        residual = self.shortcut(x)
        out = self.conv1(x)
        out = self.norm1(out)
        out = nn.ReLU(inplace=True)(out)
        out = self.conv2(out)
        out = self.norm2(out)
        out = out + residual
        out = nn.ReLU(inplace=True)(out)
        return out

I just push some updated code to redo Conv2d_WS

Kwentar commented 1 month ago

Thank you for your answer, I rechecked the sense and guess you are right, typo in article, for example, there is no "n" in figure 11 C, but we have it in figure 10 C (which one you did) Thanks for bonuses :)

johndpope commented 1 month ago

i started a branch using samsunglabs rome code - it has some of these models (gazeloss / perceptual loss / groupnorm etc) - I got stuck so put that direction on hold.

I just want to get passed the initial model setup. help me make sense of this - is this diagram an error?

been stuck on this all day. been through about 4 variations.

warpgenerator

the code looks ok - https://github.com/johndpope/MegaPortrait-hack/blob/main/model.py#L297 but just blows up on this same line here for reshaping https://github.com/Kevinfringe/MegaPortrait/blob/master/model.py#L376

let me know if your train.py runs. this is a pip freeze from env that I'm running. https://gist.github.com/johndpope/8190f3e1a85d23125c7ac56ab5f301e4

Kwentar commented 1 month ago

I'm going through the article and your code and don't have deal with warp generator yet, so, will answer about it later. But I have some missunderstanding now:

  1. I guess, in Eapp here should be 3x2xResBlock3D (you have only 1x2xResBlock3D): image With your solution Vs has dimension of 96x16x256x256, but in article we have: image I suppose (not sure), we have AvgPooling between 1-2 ResBlock3D and 3-4, also between 3-4 and 5-6, Thus, we will have 256-128-64 and 96x16x64x64 at the end For example:
    
    init:
    self.resblock3D_96_11 = ResBlock_Custom(dimension=3, input_channels=96, output_channels=96)
    self.resblock3D_96_12 = ResBlock_Custom(dimension=3, input_channels=96, output_channels=96)
    self.resblock3D_96_21 = ResBlock_Custom(dimension=3, input_channels=96, output_channels=96)
    self.resblock3D_96_22 = ResBlock_Custom(dimension=3, input_channels=96, output_channels=96)
    self.resblock3D_96_31 = ResBlock_Custom(dimension=3, input_channels=96, output_channels=96)
    self.resblock3D_96_32 = ResBlock_Custom(dimension=3, input_channels=96, output_channels=96)

self.avgpool3d = nn.AvgPool3d(kernel_size=5, stride=(1, 2, 2), padding=2) forward: vs = self.resblock3D_96_11(vs) vs = self.resblock3D_96_12(vs) vs = self.avgpool3d(vs) vs = self.resblock3D_96_21(vs) vs = self.resblock3D_96_22(vs) vs = self.avgpool3d(vs) vs = self.resblock3D_96_31(vs) vs = self.resblock3D_96_32(vs)



It also doesn't look like an ideal solution (because a lot of resnet blocks have the same channel dim, not usual), but dims are fine here, what do you think?
2. You use [`self.conv = Conv2d_WS(3, 64, 7, stride=1, padding=3)` ](https://github.com/johndpope/MegaPortrait-hack/blob/1e340971000a35dc3e17654ef811735cdd0eb246/model.py#L139) Conv2d_WS in the start of Eapp, but looks like here is simple Conv2d
3. You use [50](https://github.com/johndpope/MegaPortrait-hack/blob/1e340971000a35dc3e17654ef811735cdd0eb246/model.py#L796) as dimensions of expression vector, but authors say:
![image](https://github.com/johndpope/MegaPortrait-hack/assets/910893/44226db0-7ae1-46eb-8d11-b31a9e51338d)
And in [3] we have:
![image](https://github.com/johndpope/MegaPortrait-hack/assets/910893/3bbb0660-e7b5-4efa-9ba4-f080b368f0f6)
I guess 256 will be more "canonic", or may be you have some reason for 50? Also we have this:
![image](https://github.com/johndpope/MegaPortrait-hack/assets/910893/5866077c-6a97-4e0a-867c-5afb8af242de)
So, in order to sum these two vectors I guess Zs should be the same size as Es (2048)

I'm not 100% sure in these points, it is more about discussion, your work relly boost my progress, thanks a lot
johndpope commented 1 month ago

5b254ca3713e67da6c53da8d268def9571ce77ac this commit has an assertion removed that dived width by 4 - 256 / 4 = 64 I restore.

in supp. material - "Appearance encoder Eapp. The network consists of two parts. The first part produces a 4D tensor of volumetric features 𝑣𝑠 that represent the person’s appearance from the source image. It in- cludes several residual blocks followed by average pooling. We reshape the resulting 2D features to 3D features and then apply several 3D residual blocks to compute the final volumetric repre- sentation. The scheme shown in Figure 9 (a). The second part produces a global descriptor e𝑠 that helps retain the appearance of the output image. We use a ResNet-50 architecture with custom residual blocks. The architecture of our residual block can be seen in Figure 11 (c), where 𝑛 denotes the dimension of a convolutional layer (either 2D or 3D) and x denotes the number of output channels."

i check further later.

JZArray commented 1 month ago

Have you started to train this model? I found my generated results cannot close eyes when driven ID's eyes are closed.

Kwentar commented 1 month ago

@JZArray could you share your code? We still have some questions :)

johndpope commented 1 month ago

@JZArray - tracking the eye closure probably best bet - https://github.com/SamsungLabs/rome/blob/2ee06861f018f2ab3c0f28cb40b4633bf2e6d657/src/rome_full.py#L143

check my train.py

yet to test this out - I'm stuck on some of the architecture


# vanilla gazeloss using mediapipe

from typing import Union, List
import cv2
import mediapipe as mp
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F

class MPGazeLoss(object):
    def __init__(self, device):
        self.device = device
        self.face_mesh = mp.solutions.face_mesh.FaceMesh(static_image_mode=True, max_num_faces=1, min_detection_confidence=0.5)

    def forward(self, predicted_gaze, target_gaze, face_image):
        # Convert face image from tensor to numpy array
        face_image = face_image.detach().cpu().numpy().transpose(1, 2, 0)
        face_image = (face_image * 255).astype(np.uint8)

        # Extract eye landmarks using MediaPipe
        results = self.face_mesh.process(cv2.cvtColor(face_image, cv2.COLOR_RGB2BGR))
        if not results.multi_face_landmarks:
            return torch.tensor(0.0).to(self.device)

        eye_landmarks = []
        for face_landmarks in results.multi_face_landmarks:
            left_eye_landmarks = [face_landmarks.landmark[idx] for idx in mp.solutions.face_mesh.FACEMESH_LEFT_EYE]
            right_eye_landmarks = [face_landmarks.landmark[idx] for idx in mp.solutions.face_mesh.FACEMESH_RIGHT_EYE]
            eye_landmarks.append((left_eye_landmarks, right_eye_landmarks))

        # Compute loss for each eye
        loss = 0.0
        for left_eye, right_eye in eye_landmarks:
            # Convert landmarks to pixel coordinates
            h, w = face_image.shape[:2]
            left_eye_pixels = [(int(lm.x * w), int(lm.y * h)) for lm in left_eye]
            right_eye_pixels = [(int(lm.x * w), int(lm.y * h)) for lm in right_eye]

            # Create eye mask
            left_mask = torch.zeros((1, h, w)).to(self.device)
            right_mask = torch.zeros((1, h, w)).to(self.device)
            cv2.fillPoly(left_mask[0], [np.array(left_eye_pixels)], 1.0)
            cv2.fillPoly(right_mask[0], [np.array(right_eye_pixels)], 1.0)

            # Compute gaze loss for each eye
            left_gaze_loss = F.mse_loss(predicted_gaze * left_mask, target_gaze * left_mask)
            right_gaze_loss = F.mse_loss(predicted_gaze * right_mask, target_gaze * right_mask)
            loss += left_gaze_loss + right_gaze_loss

        return loss / len(eye_landmarks)

In this updated code:

We define a GazeBlinkLoss module that combines the gaze and blink prediction tasks.
The module consists of a backbone network (VGG-16), a keypoint network, a gaze prediction head, and a blink prediction head.
The backbone network is used to extract features from the left and right eye images separately. The features are then summed to obtain a combined eye representation.
The keypoint network takes the 2D keypoints as input and produces a latent vector of size 64.
The gaze prediction head takes the concatenated eye features and keypoint features as input and predicts the gaze direction.
The blink prediction head takes only the eye features as input and predicts the blink probability.
The gaze loss is computed using both MAE and MSE losses, weighted by w_mae and w_mse, respectively.
The blink loss is computed using binary cross-entropy loss.
The total loss is the sum of the gaze loss and blink loss.

To train this model, you can follow the training procedure you described:

Use Adam optimizer with the specified hyperparameters.
Train for 60 epochs with a batch size of 64.
Use the one-cycle learning rate schedule.
Treat the predictions from RT-GENE and RT-BENE as ground truth.

Note that you'll need to preprocess your data to provide the left eye image, right eye image, 2D keypoints, target gaze, and target blink for each sample during training.
This code provides a starting point for aligning the MediaPipe-based gaze and blink loss with the approach you described. You may need to make further adjustments based on your specific dataset and requirements.

'''
import cv2
import mediapipe as mp
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision

class GazeBlinkLoss(nn.Module):
    def __init__(self, device, w_mae=15, w_mse=10):
        super(GazeBlinkLoss, self).__init__()
        self.device = device
        self.face_mesh = mp.solutions.face_mesh.FaceMesh(static_image_mode=True, max_num_faces=1, min_detection_confidence=0.5)
        self.w_mae = w_mae
        self.w_mse = w_mse

        self.backbone = self._create_backbone()
        self.keypoint_net = self._create_keypoint_net()
        self.gaze_head = self._create_gaze_head()
        self.blink_head = self._create_blink_head()

    def _create_backbone(self):
        model = torchvision.models.vgg16(pretrained=True)
        model.classifier = nn.Sequential(*list(model.classifier.children())[:1])
        return model

    def _create_keypoint_net(self):
        return nn.Sequential(
            nn.Linear(136, 64),
            nn.ReLU(),
            nn.Linear(64, 64),
            nn.ReLU(),
            nn.Linear(64, 64),
            nn.ReLU()
        )

    def _create_gaze_head(self):
        return nn.Sequential(
            nn.Linear(320, 256),
            nn.ReLU(),
            nn.Linear(256, 2)
        )

    def _create_blink_head(self):
        return nn.Sequential(
            nn.Linear(256, 256),
            nn.ReLU(),
            nn.Linear(256, 1)
        )

    def forward(self, left_eye, right_eye, keypoints, target_gaze, target_blink):
        # Extract eye features using the backbone
        left_features = self.backbone(left_eye)
        right_features = self.backbone(right_eye)
        eye_features = left_features + right_features

        # Extract keypoint features
        keypoint_features = self.keypoint_net(keypoints)

        # Predict gaze
        gaze_input = torch.cat((eye_features, keypoint_features), dim=1)
        predicted_gaze = self.gaze_head(gaze_input)

        # Predict blink
        predicted_blink = self.blink_head(eye_features)

        # Compute gaze loss
        gaze_mae_loss = nn.L1Loss()(predicted_gaze, target_gaze)
        gaze_mse_loss = nn.MSELoss()(predicted_gaze, target_gaze)
        gaze_loss = self.w_mae * gaze_mae_loss + self.w_mse * gaze_mse_loss

        # Compute blink loss
        blink_loss = nn.BCEWithLogitsLoss()(predicted_blink, target_blink)

        # Total loss
        total_loss = gaze_loss + blink_loss

        return total_loss, predicted_gaze, predicted_blink
JZArray commented 1 month ago

@johndpope I just follow the diagram of Megaportrait to rebuild my warp model, I think at least this warp model is clear, not so problematic when compared with generators.