HeadPose net - with resnet18 backbone / pitch / yaw / roll

johndpope commented 4 weeks ago

so I did some digging - found this paper from 2022 https://arxiv.org/pdf/2210.13705

it spells out how to exactly do this - I recreate this - https://github.com/johndpope/HPENet-hack

but model needs training but now looking for eval set - leads me here - and frankly this looks much better https://github.com/thohemp/6drepnet

so I will rewire the HeadPose to just use this instead.


# Create model
# Weights are automatically downloaded
model = SixDRepNet()
img = cv2.imread('/path/to/image.jpg')
pitch, yaw, roll = model.predict(img)

UPDATE

but it doesn't support translations.... https://github.com/search?q=repo%3Athohemp%2F6DRepNet+translation&type=discussions

robinchm commented 4 weeks ago

I think we need to train the custom resnet18 in order to predict translation. The hopenet applies a series of augmentation during training which does not alter yaw, roll and pitch (except for flipping), but does alter translation I think. Any idea on whether some code applies augmentation correctly for translation?

johndpope commented 4 weeks ago

i don't know yet about translation.

I did follow the paper - and add some crop and warp function for augmenting training. i work on this some more tomorrow. https://github.com/johndpope/MegaPortrait-hack/blob/feat/14-training/train_base.py#L51

johndpope commented 3 weeks ago

When the warp and crop is applied that is center aligned according to paper l. That’s how I’ve coded this most recently in training fork - now merged. I think translation may not be a problem

robinchm commented 3 weeks ago

@johndpope

In the MegaPortraits paper it says:

We use a pre-trained network to estimate head rotation data, but the latent expression vectors z s/d and the warpings to and from the canonical coordinate space are trained without direct supervision.

...

The head pose prediction network is pre-trained, while the expression prediction network is trained from scratch.

It seems the network should be pretrained and frozen during the training of Gbase. There is no mention of the architecture of this head pose estimator, but the paper says it's inspired by "One-Shot Free-View Neural Talking-Head Synthesis for Video Conferencing".

In the referenced paper, the module is designed the same as hopenet, except for the output heads. It also has a loss that uses the pretrained hopenet to generate ground truth for rotation angles, but not translation. I assume that in the referenced paper, this module is trained from scratch.

Now the problem is how to obtain the "pretrained" resnet18 module that predicts rotation and translation. We can:

train it from scratch, but then the translation parameter in the 300w-lp dataset used by hopenet needs to be understood correctly, since almost all augmentation modifies translation
use the unofficial implementation of "One-Shot Free-View Neural Talking-Head Synthesis for Video Conferencing" paper, which contains the weight, but of a head pose estimator based on resnet50 and additionally computes expression coefficients.

johndpope commented 3 weeks ago

so that parituclar part - after inspecting - the bad results for yaw/pitch/roll - i replace with off the shelf SixDRepNet

it's possible to freeze this using .eval() after the instantiation - but I'm not saving that model.... https://github.com/johndpope/MegaPortrait-hack/blob/main/mysixdrepnet.py#L796

I attempt to extract the translation using this custom model - but failed.... so still work to do. it apparently needs retraining. what i did was augment the 2 datasources to get proper rotation parameters.

class Emtn(nn.Module):
    def __init__(self):
        super().__init__()
        # https://github.com/johndpope/MegaPortrait-hack/issues/19
        # replace this with off the shelf SixDRepNet
        self.head_pose_net = resnet18(pretrained=True).to(device)
        self.head_pose_net.fc = nn.Linear(self.head_pose_net.fc.in_features, 6).to(device)  # 6 corresponds to rotation and translation parameters
        self.rotation_net =  SixDRepNet_Detector()

        model = resnet18(pretrained=False,num_classes=512).to(device)  # 512 feature_maps = resnet18(input_image) ->   Should print: torch.Size([1, 512, 7, 7])
        # Remove the fully connected layer and the adaptive average pooling layer
        self.expression_net = nn.Sequential(*list(model.children())[:-1])
        self.expression_net.adaptive_pool = nn.AdaptiveAvgPool2d(FEATURE_SIZE)  # https://github.com/neeek2303/MegaPortraits/issues/3
        # self.expression_net.adaptive_pool = nn.AdaptiveAvgPool2d((7, 7)) #OPTIONAL 🤷 - 16x16 is better?

        ## TODO 2
        outputs=COMPRESS_DIM ## 512,,方便后面的WarpS2C操作 512 -> 2048 channel
        self.fc = torch.nn.Linear(2048, outputs)

    def forward(self, x):
        # Forward pass through head pose network
        rotations,_ = self.rotation_net.predict(x)
        logging.debug(f"📐 rotation :{rotations}")

        head_pose = self.head_pose_net(x)

        # Split head pose into rotation and translation parameters
        # rotation = head_pose[:, :3]  - this is shit
        translation = head_pose[:, 3:]

        # Forward pass image through expression network
        expression_resnet = self.expression_net(x)
        ### TODO 2
        expression_flatten = torch.flatten(expression_resnet, start_dim=1)
        expression = self.fc(expression_flatten)  # (bs, 2048) ->>> (bs, COMPRESS_DIM)

        return rotations, translation, expression
    #This encoder outputs head rotations R𝑠/𝑑 ,translations t𝑠/𝑑 , and latent expression descriptors z𝑠/𝑑

consider that when the training is underway - there's a mask that dictates where the head should be drawn into...so it kinda must learn where to draw from the source.

johndpope commented 3 weeks ago

I contacted author - @chientv99 - and he sent this - https://github.com/chientv99/maskpose

unfortunately the pretrained weights are missing. :(

robinchm commented 3 weeks ago

I contacted author - @chientv99 - and he sent this - https://github.com/chientv99/maskpose

unfortunately the pretrained weights are missing. :(

I browsed the code and paper a bit. If I am not wrong, this project does not address translation at all.

I am pretraining a hopenet on resnet18, using dataset 300W-LP. My observation is that angles converge easily, but translation is much harder. Translation along xy still seems converging, though slowly. Translation along z does not converge at all. This is probably because my augmentation does not crop aggressively enough (translation along can only be modified by cropping + resizing).

@johndpope If you can contact the author, do you mind to ask them which dataset do you use to pretrain the model and how is translation predicted?

johndpope commented 3 weeks ago

He forwarded a message to a lady to find the weights. I’m pretty sure from train script it’s the same one as you’re using. With the preprocessing steps to get images - there’s some caveats

MP

They don’t do backgrounds
Shoulders are also off
They recreate videos with matting and third party libraries

I work on gaze loss - it’s converging / though 3090 GPU is getting cooked….

Need some cloud compute. https://github.com/johndpope/MegaPortrait-hack/issues/36

robinchm commented 3 weeks ago

He forwarded a message to a lady to find the weights. I’m pretty sure from train script it’s the same one as you’re using. With the preprocessing steps to get images - there’s some caveats

MP
1. They don’t do backgrounds

2. Shoulders are also off

3. They recreate videos with matting and third party libraries
I work on gaze loss - it’s converging / though 3090 GPU is getting cooked….

Need some cloud compute. #36

I don't quite get it - are you saying that the author seems to be also pretraining a hopenet with 300W-LP dataset? Any detail in the design of prediction head - same as original hopenet or using 6drepnet like what you implemented?

I understand that megaportraits need matting and don't do background, but that should not affect how this module is pretrained.

And my translation is indeed converging, with z axis slowest, I should have waited a bit longer.

johndpope commented 3 weeks ago

In the paper they say the head centered and cropped(if I recall correctly) so the shifting left / right shouldn’t matter. I can extend the augmentation of frames to include both a zoomed in and sweet spot crop. The model should learn to do both. My interest now is to work on VASA which will disentangle with transformer. The portrait code is going to drop so this will be completely academic exercise.

johndpope commented 3 weeks ago

This has different cropping from v-express findings - https://github.com/johndpope/MegaPortrait-hack/issues/36

johndpope commented 3 weeks ago

FYI - https://github.com/johndpope/MegaPortrait-hack/issues/36 - this part of architecture maybe redundant

robinchm commented 2 weeks ago

Thanks a lot! Results in the referenced issue gave me a lot of confidence that warping is not a critical component. We can retrain with the module plugged in later, if it turns out that explicit control of the pose is needed (it's sort of good to have in my case, but not absolutely necessary).

johndpope / MegaPortrait-hack

HeadPose net - with resnet18 backbone / pitch / yaw / roll #19