Current results of training - epoch 4

johndpope commented 3 weeks ago

i used another of the videos as driving - and it's (almost) obviously not rotating the head past the point where the original movie went - see below.

Screenshot from 2024-06-04 22-45-21

cross_reenacted_image_57

pred_frame_191

tomorrow i plug in bigger dataset.

UPDATE - https://github.com/johndpope/MegaPortrait-hack/pull/37

when I normalize the images - i end up with this - looks bad - I add some code in train.py to un-normalize - happy with current results....

fyi - this is the frames dump out from mp4 - head cropped / maybe some warping. Screenshot from 2024-06-04 23-11-08

johndpope commented 3 weeks ago

Screenshot from 2024-06-05 07-35-58 https://github.com/tencent-ailab/V-Express/blob/main/assets/crop_example.jpeg i adjust the cropping to find a sweet spot.

https://github.com/johndpope/MegaPortrait-hack/pull/37

cross_reenacted_image_350

pred_frame_356

Jie-zju commented 3 weeks ago

I trained on data like vox. Looking forward for more results!

johndpope commented 3 weeks ago

pred_frame_361 epoch 21 - it's converging.....

Jie-zju commented 3 weeks ago

So，as I mentioned before. Loss on face foreground?

johndpope commented 3 weeks ago

there's like 6-7 different losses - https://github.com/johndpope/MegaPortrait-hack/blob/main/train.py i didnt do gaze loss yet (i drafted it - hit a snag - need to take another look)

https://github.com/johndpope/MegaPortrait-hack/blob/ff9cf22b1be4093e63d9e9c02fcb83d77cca8c1d/model.py#L1885

johndpope commented 3 weeks ago

epoch 50 pred_frame_355

JZArray commented 3 weeks ago

epoch 50

Is this the self-reconstruction result?

JZArray commented 3 weeks ago

How do you reenactment results look like in the eval dataset? BTW, how many IDs have you used to train this model?

johndpope commented 3 weeks ago

I need 2 years of gpu time to actually complete 200,000 epochs - I have dataset here 35,000 videos https://github.com/johndpope/MegaPortrait-hack/pull/37

The augmentation I’m rendering every frame - but hit a snag with different lengths so I’m only overfitting to 1 source video - 1 driving - 1 star source - 1 star driving. Don’t really want to burn out my 3090 card - I’m looking at vertex ai - the pre processing to warp and crop is significant time sink.

I’m exploring cheaper gpu training hacks to collapse training time line. My 3090 can spit out very res - hi fidelity images with stable diffusion - so this code is useless to me if I can’t train it.

https://github.com/johndpope/LadaGAN-pytorch

JZArray commented 3 weeks ago

I need 2 years of gpu time to actually complete 200,000 epochs - I have dataset here 35,000 videos #37

The augmentation I’m rendering every frame - but hit a snag with different lengths so I’m only overfitting to 1 source video - 1 driving - 1 star source - 1 star driving. Don’t really want to burn out my 3090 card - I’m looking at vertex ai - the pre processing to warp and crop is significant time sink.

I’m exploring cheaper gpu training hacks to collapse training time line. My 3090 can spit out very res - hi fidelity images with stable diffusion - so this code is useless to me if I can’t train it.

https://github.com/johndpope/LadaGAN-pytorch

ok, I see, because when increasing the number of IDs, in my case, ID leakage appears, not sure whether you also have the same problem with your codes.

Kwentar commented 3 weeks ago

Hi, congratulations! I decided to change their pipeline and currently far away from paper, in my experience:

Two losses are important: perceptual ( I use awesome lpips) and cross entropy on Z vectors (the way to fix ID leakage)
We don't need Es at all
We don't need rotation/translation warping operation, result of warping generator is enough
One grid sample is enough (before g3d)

I am currently training on VoxCeleb2, results: Drive: Predicted drive: Predicted S* based on drive:

If you have questions or need details feel free to ask

JZArray commented 3 weeks ago

Hi, congratulations! I decided to change their pipeline and currently far away from paper, in my experience:

Two losses are important: perceptual ( I use awesome lpips) and cross entropy on Z vectors (the way to fix ID leakage)

We don't need Es at all

We don't need rotation/translation warping operation, result of warping generator is enough

One grid sample is enough (before g3d)

I am currently training on VoxCeleb2, results: Drive: Predicted drive: Predicted S* based on drive:

If you have questions or need details feel free to ask

@Kwentar Nice work！ Could you tell more about how to do warp in your case？ Or maybe can you share your codes？

johndpope commented 3 weeks ago

You may have more luck cherry picking my warp and crop /rremove background code in emodataset - I’m saving out npz file for faster iterations next run. They say in paper they don’t do backgrounds. Did you look at gaze loss? With epochs - my code is saying cycle through short video 90+ frames. Is that fair? Or is that 90 epochs?? They mention batch size 16, is that 16 frames total of video? The more training I do on single video - the better it gets. What amount of training did you get to? What gpu compute do you have? Ru training in cloud? Azure / aws? The video dataset / code from emoportraits is going to render this codebase obsolete- what’s your motives? Academic or commercial? I have some videos from Voxceleb2 -

How big are ur checkpoints? Presumably your saving discriminator? This is patchgan cyclegan based - would it make results on my training dramatically improve if you just share that? I was only able to get where I am thanks to @kevinFringe - and his cycle consistency loss he used a concatenation of 2 images - not sure this is necessary/ desirable. Its in Main branch.

I’m interested to plug in novel architectures - the vasa stuff - but there’s also others. How about you? This architecture can’t do audio - is that important to you?

Share code if you can.

Regarding throwing out the rotation / translation - I kinda see why this would still work (being more like a face swapper) - but have a look at this video - https://github.com/michaildoukas/headGAN/issues/10 where the control of head pose of target is completely disentangled. The MS vasa I think had this capability too. Theres many libraries - repos doing this video a - drive video b - aniportrait - Is outstanding- but the magic with this architecture is high frame rate - real time controlling - I switch back to vasa (which needs to be completely rebuilt ) in a little bit.

https://github.com/johndpope/VASA-1-hack

Kwentar commented 3 weeks ago

@johndpope I am not ready to share code because it is unreadable :D

I didnt do anything with data yet, just use only voxceleb2 academic dataset without any processing and augmentation (closest plans remove background and make losses on eyes and mouths)
Currently I have only two losses: lpisp and cosFace on Zs, will add more losses later
About epoch -- dont worry, it really doesnt matter and it is just conventions, i.e. my epoch each "one pair from video", so, my batch is 12*8 (I Have 8 GPUs) and epoch is 11000 iters. Currently it is still training, images above at 34000 iters (3+ epochs)
what’s your motives? -- I do it for commercial and as I said I used megaportrait only for start
How big are ur checkpoints? Presumably your saving discriminator? -- I still have no success with discriminator and I dont use it, checkpoints around 500MB all networks (g2d is biggest with 300MB)
I’m interested to plug in novel architectures - the vasa stuff - but there’s also others. How about you? This architecture can’t do audio - is that important to you? -- Yes, I go to megaportrait after Vasa, so, audio is important and it is quite ez -- only we need is make Eaudio with the same output as Emtn. Currently I am trying to move this task to diffussion network, but have no success here yet

Warp Generator:

class ResBlock3D(nn.Module):
    def __init__(self, input_channels, output_channels, padding=1):
        super().__init__()
        self.conv1 = nn.Conv3d(input_channels, output_channels, 3, padding=padding)
        self.conv2 = nn.Conv3d(output_channels, output_channels, 3, padding=padding)
        if input_channels != output_channels:
            self.shortcut = nn.Sequential(
                nn.Conv3d(input_channels, output_channels, kernel_size=1),
                nn.GroupNorm(num_channels=output_channels, num_groups=32)
            )
        else:
            self.shortcut = nn.Identity()

    def forward(self, x):
        residual = x
        out = self.conv1(x)
        out = F.group_norm(out, num_groups=32)
        out = F.relu(out)
        out = self.conv2(out)
        out = F.group_norm(out, num_groups=32)
        if self.shortcut is not None:
            residual = self.shortcut(x)
        out += residual
        out = F.relu(out)
        return out

class WarpGenerator(nn.Module):
    def __init__(self, input_channels):
        super(WarpGenerator, self).__init__()

        self.conv1 = nn.Conv2d(in_channels=input_channels, out_channels=2048, kernel_size=1, padding=0, stride=1)
        self.resblock1 = ResBlock3D(512, 256)
        self.resblock2 = ResBlock3D(256, 128)
        self.resblock3 = ResBlock3D(128, 64)
        self.resblock4 = ResBlock3D(64, 32)

        self.gn = nn.GroupNorm(num_groups=32, num_channels=32)
        self.conv2 = nn.Conv3d(32, 3, kernel_size=3, padding=1)

    def forward(self, zs_es):
        x = self.conv1(zs_es)
        x = x.view(x.size(0), 512, 4, x.size(2), x.size(3))

        x = self.resblock1(x)
        x = F.upsample(x, scale_factor=(2, 2, 2), mode='trilinear', align_corners=True)
        x = self.resblock2(x)
        x = F.upsample(x, scale_factor=(2, 2, 2), mode='trilinear', align_corners=True)
        x = self.resblock3(x)
        x = F.upsample(x, scale_factor=(1, 2, 2), mode='trilinear', align_corners=True)
        x = self.resblock4(x)
        x = F.upsample(x, scale_factor=(1, 2, 2), mode='trilinear', align_corners=True)

        x = self.gn(x)
        x = F.relu(x, inplace=True)
        x = self.conv2(x)
        x = F.tanh(x)
        return x

GBase Inference:

motion_latent_source = z_network(source)
motion_latent_drive = z_network(drive)
volume_source = model_app(source)

Wem_source = warping_generator(torch.cat([-motion_latent_source, motion_latent_drive], dim=1))
g3d_input = F.grid_sample(volume_source, Wem_source.permute(0, 2, 3, 4, 1), align_corners=True)

volume_generated = g3d(g3d_input)
x_generated = g2d(volume_generated)

JZArray commented 3 weeks ago

@Kwentar thanks for sharing the information. Could you tell the motivation of removing the rotation and translation, and using only warp once? Since it lacks the ability of controlling the head pose explicitly. Have you try following the paper exactly, and how about the result?

Kwentar commented 3 weeks ago

@JZArray

Could you tell the motivation of removing the rotation and translation, and using only warp once? I am a fan of "end2end" and dont need ability of controling head pose explicitly :)
Have you try following the paper exactly Yes, but no success -- the article has A LOT of problems (wrong architecture images, wrong formulas, abnormal quantity of losses etc.) So, I decided to do it based on my experience

hazard-10 commented 2 weeks ago

Judging from these preliminary resutls it seems like RT-bene for gaze loss isn't necessary at all ?

hazard-10 commented 2 weeks ago

@Kwentar Hey great work! Do you mind sharing a bit more on hardware usage like gpu spec. sxm/pcie, vram, and training time each iter / epoch ? And roughtly many epochs are you planning on training with 8 gpus for convergence before transferring that to VASA ?

johndpope commented 2 weeks ago

There’s a cross re-enactment image that gets spat out. Quality is low at 50 epochs. The eyes are aligning up - this code has mpgazeloss using mediapope which may do the job. But in the other ticket I describe preparation data different eyes blinking - not trivial. Also I want preprocessing of videos to happen 1000x faster - it’s taking 5 mins per video https://github.com/dmlc/decord/issues/302

The main branch should run with training as is - let me know if it doesn’t. Theres a feature branch I’m stabilising to get more videos / ids- though I hit a bump.

UPDATE - fyi @Kwentar - diffusion + talking - https://github.com/tencent-ailab/V-Express/issues/6 ( no training code)

JZArray commented 2 weeks ago

@johndpope I am not ready to share code because it is unreadable :D

I didnt do anything with data yet, just use only voxceleb2 academic dataset without any processing and augmentation (closest plans remove background and make losses on eyes and mouths)

Currently I have only two losses: lpisp and cosFace on Zs, will add more losses later

About epoch -- dont worry, it really doesnt matter and it is just conventions, i.e. my epoch each "one pair from video", so, my batch is 12*8 (I Have 8 GPUs) and epoch is 11000 iters. Currently it is still training, images above at 34000 iters (3+ epochs)

what’s your motives? -- I do it for commercial and as I said I used megaportrait only for start

How big are ur checkpoints? Presumably your saving discriminator? -- I still have no success with discriminator and I dont use it, checkpoints around 500MB all networks (g2d is biggest with 300MB)

I’m interested to plug in novel architectures - the vasa stuff - but there’s also others. How about you? This architecture can’t do audio - is that important to you? -- Yes, I go to megaportrait after Vasa, so, audio is important and it is quite ez -- only we need is make Eaudio with the same output as Emtn. Currently I am trying to move this task to diffussion network, but have no success here yet

Warp Generator:
class ResBlock3D(nn.Module):
    def __init__(self, input_channels, output_channels, padding=1):
        super().__init__()
        self.conv1 = nn.Conv3d(input_channels, output_channels, 3, padding=padding)
        self.conv2 = nn.Conv3d(output_channels, output_channels, 3, padding=padding)
        if input_channels != output_channels:
            self.shortcut = nn.Sequential(
                nn.Conv3d(input_channels, output_channels, kernel_size=1),
                nn.GroupNorm(num_channels=output_channels, num_groups=32)
            )
        else:
            self.shortcut = nn.Identity()

    def forward(self, x):
        residual = x
        out = self.conv1(x)
        out = F.group_norm(out, num_groups=32)
        out = F.relu(out)
        out = self.conv2(out)
        out = F.group_norm(out, num_groups=32)
        if self.shortcut is not None:
            residual = self.shortcut(x)
        out += residual
        out = F.relu(out)
        return out

class WarpGenerator(nn.Module):
    def __init__(self, input_channels):
        super(WarpGenerator, self).__init__()

        self.conv1 = nn.Conv2d(in_channels=input_channels, out_channels=2048, kernel_size=1, padding=0, stride=1)
        self.resblock1 = ResBlock3D(512, 256)
        self.resblock2 = ResBlock3D(256, 128)
        self.resblock3 = ResBlock3D(128, 64)
        self.resblock4 = ResBlock3D(64, 32)

        self.gn = nn.GroupNorm(num_groups=32, num_channels=32)
        self.conv2 = nn.Conv3d(32, 3, kernel_size=3, padding=1)

    def forward(self, zs_es):
        x = self.conv1(zs_es)
        x = x.view(x.size(0), 512, 4, x.size(2), x.size(3))

        x = self.resblock1(x)
        x = F.upsample(x, scale_factor=(2, 2, 2), mode='trilinear', align_corners=True)
        x = self.resblock2(x)
        x = F.upsample(x, scale_factor=(2, 2, 2), mode='trilinear', align_corners=True)
        x = self.resblock3(x)
        x = F.upsample(x, scale_factor=(1, 2, 2), mode='trilinear', align_corners=True)
        x = self.resblock4(x)
        x = F.upsample(x, scale_factor=(1, 2, 2), mode='trilinear', align_corners=True)

        x = self.gn(x)
        x = F.relu(x, inplace=True)
        x = self.conv2(x)
        x = F.tanh(x)
        return x
GBase Inference:
motion_latent_source = z_network(source)
motion_latent_drive = z_network(drive)
volume_source = model_app(source)

Wem_source = warping_generator(torch.cat([-motion_latent_source, motion_latent_drive], dim=1))
g3d_input = F.grid_sample(volume_source, Wem_source.permute(0, 2, 3, 4, 1), align_corners=True)

volume_generated = g3d(g3d_input)
x_generated = g2d(volume_generated)

@Kwentar hallo, a quick question about F.grid sample, don't you need to first interpolate Wem_source [batch, 3 ,4, 16, 16] to the same shape as volume_source [batch, 96, 16, 32, 32]? Otherwise, after F_grid sample operation, the shape of volume_source will be changed into [batch, 96, 4, 16, 16], and cannot be correctly processed by the following g2d module, because g2d requires the input with the shape [batch, 96, 16, 32, 32].

johndpope commented 2 weeks ago

May not be specifc case here but I found this discrepancy creeps in with different image inputs 256 vs 512. My code in main is only handling 512 atm.

JackAILab commented 2 weeks ago

纪元 50

hi, @johndpope May I ask if your current results are the in-domain results saved during training? Have you tried to test during inference and use some out-of-domain results?

My current model structure is mostly consistent with your current structure. Unfortunately, the visualization results saved during my current training process are very good, but the results of the inference process (one epoch) are very bad. It may be that the number of epochs I trained is not enough, or there is a problem with the model structure described in the paper. @Kwentar Can you share more experience regarding the current results?

Currently, I use 2024 song videos and 2880 speech videos from the RAVDESS data (as S* frames). After one epoch, loss_perceptual converges from 150 to 73.3125.

epoch0 -> epoch1 in training process (bs is set to 6, and 6 images are output side by side.) cross_reenacted_0

cross_reenacted_40

cross_reenacted_79

cross_reenacted_40_0

cross_reenacted_83_55

epoch1 in infer process (The first one is the original image, the second one is the driving image, and the last one is the output result. The result is not converged.)

output_25_epoch3_0 output_51_epoch3_0

JZArray commented 2 weeks ago

纪元 50

hi, @johndpope May I ask if your current results are the in-domain results saved during training? Have you tried to test during inference and use some out-of-domain results?

My current model structure is mostly consistent with your current structure. Unfortunately, the visualization results saved during my current training process are very good, but the results of the inference process (one epoch) are very bad. It may be that the number of epochs I trained is not enough, or there is a problem with the model structure described in the paper. @Kwentar Can you share more experience regarding the current results?

Currently, I use 2024 song videos and 2880 speech videos from the RAVDESS data (as S* frames). After one epoch, loss_perceptual converges from 150 to 73.3125.

epoch0 -> epoch1 in training process (bs is set to 6, and 6 images are output side by side.)

epoch1 in infer process (The first one is the original image, the second one is the driving image, and the last one is the output result. The result is not converged.)

are 4.th and 5.th row the self-reconstruction results in the training set?

JackAILab commented 2 weeks ago

@JZArray yes

JZArray commented 2 weeks ago

@JZArray yes

kind of weird, your self-reconstruction results in training dataset seem very well (not sure whether your model is overfitting, but it seems impossible that overfitting happens in one epoch), does your loss still decrease, or already converged?

johndpope commented 2 weeks ago

Screenshot from 2024-06-06 18-52-11 if you run the train.py on main - you should see these images in output_images folder.

there's a cross_reenacted_image_1 - this outside of the training set (using the source /driving *) - this only got to 50 epochs. my pr had some smarts to load up the 35,000 videos from ffhq- but it's not working - it maybe trivial fix to just align all the frames from 4 movies - to say 100 frames. at the moment it's accommodating every length of movie. https://github.com/johndpope/MegaPortrait-hack/pull/37 cross_reenacted_image_1

for clarity there's a couple of json files in this repo - the overfitting one just hard codes things to selena gomez - this is first video from this torrent - https://academictorrents.com/details/843b5adb0358124d388c4e9836654c246b988ff4

https://github.com/johndpope/Emote-hack/issues/1

Screenshot from 2024-06-06 18-58-37

just point to json_file: './data/celebvhq_info.json' <-35,000 videos. Screenshot from 2024-06-06 18-58-19 but this is a bug with this.

in my Emodataset code - im using warp and crop to achieve some effect as described in paper it will also remove the backgrounds.

I document here https://github.com/johndpope/MegaPortrait-hack/blob/main/EmoDataset.md

Screenshot from 2024-06-06 19-02-17

you will see -> a npz that file will quickly reload the numpy arrays for video without needing to reload images / and run transforms (speeding things up). Screenshot from 2024-06-06 19-02-32

would be happy to accept PR for this to work with VoXceleb2 - i wasnt' able to get my hands on dataset.

JackAILab commented 2 weeks ago

thanks! @johndpope I used multiple driving * data, not a single video, but 2880 speech videos from the RAVDESS data. Using a single video may lead to overfitting.

As a result, I think it has not converged yet @JZArray . I think loss_perceptual should converge to less than 20. I am using 8 A100 GPUs to continuously train and optimize the model structure.

I want to ensure that after loading the trained model, the inference process uses a source IMG and a driving IMG to obtain relatively ideal results.

At least it proves that Emtn is an effective module.

Kwentar commented 2 weeks ago

@Kwentar hallo, a quick question about F.grid sample, don't you need to first interpolate Wem_source [batch, 3 ,4, 16, 16] to the same shape as volume_source [batch, 96, 16, 32, 32]? Otherwise, after F_grid sample operation, the shape of volume_source will be changed into [batch, 96, 4, 16, 16], and cannot be correctly processed by the following g2d module, because g2d requires the input with the shape [batch, 96, 16, 32, 32].

@JZArray Yes, This is one of the problem of source article, to fix this I have Z dimension [512, 2, 2] and as the result Wem_source [batch, 3 ,16, 32, 32]

@Kwentar Hey great work! Do you mind sharing a bit more on hardware usage like gpu spec. sxm/pcie, vram, and training time each iter / epoch ? And roughtly many epochs are you planning on training with 8 gpus for convergence before transferring that to VASA ?

@hazard-10 I have cloud devbox with 8xA100, epoch time ~ 6 hours (epoch ~11k iters), I am not finish experiments yes because changing a lot of things, but I guess 100k iters more than enough

@JackAILab I guess you have not enough data, for example voxceleb has more than 100k videos. Also, your loss is to big (or is it with weight?) My lower than 1

johndpope commented 2 weeks ago

I probably rip out decord tomorrow got a few alternatives from someone doing videos at scale /https://github.com/dmlc/decord/issues/283

JZArray commented 2 weeks ago

@Kwentar Could you please make codes available?

JZArray commented 2 weeks ago

@Kwentar Hi, did you make any other modifications comparing with the original paper? Currently we have followed your instructions about the warping module and implement the other module following the paper. But the predict results can not capture the small movements in the local area, especially for the lip movement and eye blink. I am wondering if the resolution of appearance feature which we use 3232 is too small, any suggestions? BTW, we train on 256 256 input.

johndpope commented 2 weeks ago

@JZArray - i built a lip loss function for VASA some months back - hasn't been tested - https://github.com/johndpope/VASA-1-hack/blob/main/train.py#L146 - but its' built on top of mediapipe - so I expect it will be good results. I will be plugging in soon. Just wait till early next week. I've been focusing on preprocessing - found some multicore code to quickly process through frames

but ideally this would be on gpu. but pilimage doesn't play well with gpu. https://github.com/johndpope/MegaPortrait-hack/issues/38

@JackAILab - can you confirm which code is not working for you? are you using the most recent main code?

fyi - in my latest branch - i turned on green screen for background - https://github.com/johndpope/MegaPortrait-hack/tree/feat/38-multicore what does everyone think about this? its going to make it easier to switch out background when model is trained - if we use white / black - then this be harder to quickly remove background - the green you can do on the fly.

ChenyangWang95 commented 2 weeks ago

I used 1000 videos from celebaVox2 to train the model in 256*256 resolution, but i find that the losses don't converge. Does it seem reasonable?

The results are below at 142840 iterations. The face seems blur and the details lose.

Following @Kwentar, I added LPIPS loss to train the model for achieving a clearer facial image. Not sure if it works yet.

johndpope commented 2 weeks ago

@ChenyangWang95 make sure your branch is up to date - https://github.com/ChenyangWang95/MegaPortrait-hack/blob/main/train.py#L217 - this training code was overhauled completely - ( a huge chunk of logic was not implemented around cyclic consistency loss / negative / postive pairs - the paper mentions that this is important ) - https://github.com/johndpope/MegaPortrait-hack/issues/32 ) https://github.com/johndpope/MegaPortrait-hack/blob/main/train.py#L204

the paper is also saying that they don't do backgrounds - ONLY operating loss on foregrounds - so it's necessary to preprocess the images IMO - to achieve same results - they actually say they basically do post processing to create videos. my emodataset loader is doing some extra steps to rip out background and zoom in on face. From Selena picture above - and the elephant man like recreate image (this is epoch 50 or 200,000?) - so im confident that this will actually converge with longer training - it's just i dont have the gpu bandwidth (will take 2 years on 3090) - the other observeration is - maybe there's a faster way to leak a secret pseudo ground truth video - eg if i just run the driving video through reactor with selena gomez face swapped in https://github.com/Gourieff/sd-webui-reactor/tree/main - it would have better geometry - so maybe able to collapse training time.

UPDATE - i add lpips loss on this branch https://github.com/johndpope/MegaPortrait-hack/tree/feat/26-auditflops it takes 10 mins to cut out the backgrounds / crop faces - and 8 x A100 aint gonna speed up things - https://github.com/danielgatis/rembg/issues/632 https://github.com/johndpope/MegaPortrait-hack/issues/38

running from main - you should see these popuplated with images + npz file - once we have the npz - subsequent training runs will be fast.

UPDATE maybe we can use CuPy toget images to gpu - but still need to get libraries compatible with this. https://cupy.dev/ i cut out pillow here (wip) - https://github.com/johndpope/MegaPortrait-hack/tree/feat/38-multicore-no-pil

johndpope commented 2 weeks ago

fyi - anyone actually training might want to check this out https://github.com/johndpope/MegaPortrait-hack/issues/41

flyingshan commented 2 weeks ago

@Kwentar Hi, did you make any other modifications comparing with the original paper? Currently we have followed your instructions about the warping module and implement the other module following the paper. But the predict results can not capture the small movements in the local area, especially for the lip movement and eye blink. I am wondering if the resolution of appearance feature which we use 3232 is too small, any suggestions? BTW, we train on 256 256 input.

@JZArray Have you solved this? I encountered the same problem. The eyes won't close (but the closed eyes can be open). Did you train the model in an end to end way as Kwentar did or control the pose explicitly?

JZArray commented 2 weeks ago

@Kwentar Hi, did you make any other modifications comparing with the original paper? Currently we have followed your instructions about the warping module and implement the other module following the paper. But the predict results can not capture the small movements in the local area, especially for the lip movement and eye blink. I am wondering if the resolution of appearance feature which we use 3232 is too small, any suggestions? BTW, we train on 256 256 input.

@JZArray Have you solved this? I encountered the same problem. The eyes won't close (but the closed eyes can be open). Did you train the model in an end to end way as Kwentar did or control the pose explicitly?

@flyingshan Not yet, we still insist on our model, not try his model. Furthermore, may I ask how your model's generalization ability looks like, can your model generalize well to unseen ID, i.e. does ID leakage appear?

flyingshan commented 2 weeks ago

JZArray

I have implemented the cycle loss mentioned in the paper, but I want to first solve the expression fitting problem now so this loss is not included in the training yet. To generalize to unseen ID, I think it is important to have a large dataset with various IDs. I think voxceleb2 with 6k IDs will do.

JZArray commented 2 weeks ago

JZArray

I have implemented the cycle loss mentioned in the paper, but I want to first solve the expression fitting problem now so this loss is not included in the training yet. To generalize to unseen ID, I think it is important to have a large dataset with various IDs. I think voxceleb2 with 6k IDs will do.

@flyingshan Ok, did you use the model provided by @johndpope, or implement it by yourself? I also implement the cycle loss, but not used yet, because I want to the model first to have a good generalization ability to unssen IDs when doing self-reenactment tasks, then pay more attnetion to cross-ID reenactment cases. Otherwises, I think it makes no sense to use cycle loss when the model is unable to do self-reenactment tasks to seen/unseen IDs.

flyingshan commented 2 weeks ago

JZArray

I have implemented the cycle loss mentioned in the paper, but I want to first solve the expression fitting problem now so this loss is not included in the training yet. To generalize to unseen ID, I think it is important to have a large dataset with various IDs. I think voxceleb2 with 6k IDs will do.

@flyingshan Ok, did you use the model provided by @johndpope, or implement it by yourself? I also implement the cycle loss, but not used yet, because I want to the model first to have a good generalization ability to unssen IDs when doing self-reenactment tasks, then pay more attnetion to cross-ID reenactment cases. Otherwises, I think it makes no sense to use cycle loss when the model is unable to do self-reenactment tasks to seen/unseen IDs.

I implement G3d by myself according to the paper. The warping module is implemented according to Kwentar. As for Eapp/G2d, I adopt the structure in face-vid2vid, and modify some structures to make them look like the structure in megaportrait.

JZArray commented 2 weeks ago

JZArray

I have implemented the cycle loss mentioned in the paper, but I want to first solve the expression fitting problem now so this loss is not included in the training yet. To generalize to unseen ID, I think it is important to have a large dataset with various IDs. I think voxceleb2 with 6k IDs will do.

@flyingshan Ok, did you use the model provided by @johndpope, or implement it by yourself? I also implement the cycle loss, but not used yet, because I want to the model first to have a good generalization ability to unssen IDs when doing self-reenactment tasks, then pay more attnetion to cross-ID reenactment cases. Otherwises, I think it makes no sense to use cycle loss when the model is unable to do self-reenactment tasks to seen/unseen IDs.

I implement G3d by myself according to the paper. The warping module is implemented according to Kwentar. As for Eapp/G2d, I adopt the structure in face-vid2vid, and modify some structures to make them look like the structure in megaportrait.

@flyingshan can you share some visual results here if possible?

flyingshan commented 2 weeks ago

JZArray

I have implemented the cycle loss mentioned in the paper, but I want to first solve the expression fitting problem now so this loss is not included in the training yet. To generalize to unseen ID, I think it is important to have a large dataset with various IDs. I think voxceleb2 with 6k IDs will do.

@flyingshan Ok, did you use the model provided by @johndpope, or implement it by yourself? I also implement the cycle loss, but not used yet, because I want to the model first to have a good generalization ability to unssen IDs when doing self-reenactment tasks, then pay more attnetion to cross-ID reenactment cases. Otherwises, I think it makes no sense to use cycle loss when the model is unable to do self-reenactment tasks to seen/unseen IDs.

I implement G3d by myself according to the paper. The warping module is implemented according to Kwentar. As for Eapp/G2d, I adopt the structure in face-vid2vid, and modify some structures to make them look like the structure in megaportrait.

@flyingshan can you share some visual results here if possible?

I only got some visualizations in the evaluation process. [source/prediction/drive]

JZArray commented 2 weeks ago

JZArray

I have implemented the cycle loss mentioned in the paper, but I want to first solve the expression fitting problem now so this loss is not included in the training yet. To generalize to unseen ID, I think it is important to have a large dataset with various IDs. I think voxceleb2 with 6k IDs will do.

@flyingshan Ok, did you use the model provided by @johndpope, or implement it by yourself? I also implement the cycle loss, but not used yet, because I want to the model first to have a good generalization ability to unssen IDs when doing self-reenactment tasks, then pay more attnetion to cross-ID reenactment cases. Otherwises, I think it makes no sense to use cycle loss when the model is unable to do self-reenactment tasks to seen/unseen IDs.

I implement G3d by myself according to the paper. The warping module is implemented according to Kwentar. As for Eapp/G2d, I adopt the structure in face-vid2vid, and modify some structures to make them look like the structure in megaportrait.

@flyingshan can you share some visual results here if possible?

I only got some visualizations in the evaluation process. [source/prediction/drive]

@flyingshan we got the similar results hhhh

JZArray commented 2 weeks ago

JZArray

I have implemented the cycle loss mentioned in the paper, but I want to first solve the expression fitting problem now so this loss is not included in the training yet. To generalize to unseen ID, I think it is important to have a large dataset with various IDs. I think voxceleb2 with 6k IDs will do.

@flyingshan Ok, did you use the model provided by @johndpope, or implement it by yourself? I also implement the cycle loss, but not used yet, because I want to the model first to have a good generalization ability to unssen IDs when doing self-reenactment tasks, then pay more attnetion to cross-ID reenactment cases. Otherwises, I think it makes no sense to use cycle loss when the model is unable to do self-reenactment tasks to seen/unseen IDs.

I implement G3d by myself according to the paper. The warping module is implemented according to Kwentar. As for Eapp/G2d, I adopt the structure in face-vid2vid, and modify some structures to make them look like the structure in megaportrait.

@flyingshan can you share some visual results here if possible?

I only got some visualizations in the evaluation process. [source/prediction/drive]

@flyingshan we got the similar results hhhh

@flyingshan Forgot to ask, how many IDs are you used now? And in your evaluationresults, did these IDs also appear in train dataset?

flyingshan commented 2 weeks ago

@JZArray I have not really counted this yet, it may around hundreds. The ID in evaluation may appear in the training set.

coachqiao2018 commented 2 weeks ago

JZArray

I have implemented the cycle loss mentioned in the paper, but I want to first solve the expression fitting problem now so this loss is not included in the training yet. To generalize to unseen ID, I think it is important to have a large dataset with various IDs. I think voxceleb2 with 6k IDs will do.

@flyingshan Ok, did you use the model provided by @johndpope, or implement it by yourself? I also implement the cycle loss, but not used yet, because I want to the model first to have a good generalization ability to unssen IDs when doing self-reenactment tasks, then pay more attnetion to cross-ID reenactment cases. Otherwises, I think it makes no sense to use cycle loss when the model is unable to do self-reenactment tasks to seen/unseen IDs.

I implement G3d by myself according to the paper. The warping module is implemented according to Kwentar. As for Eapp/G2d, I adopt the structure in face-vid2vid, and modify some structures to make them look like the structure in megaportrait.

@flyingshan can you share some visual results here if possible?

I only got some visualizations in the evaluation process. [source/prediction/drive]

The results seems like mine. The lip and mouth areas are not driven well. I follow the paper to implement it.

JZArray commented 2 weeks ago

JZArray

I have implemented the cycle loss mentioned in the paper, but I want to first solve the expression fitting problem now so this loss is not included in the training yet. To generalize to unseen ID, I think it is important to have a large dataset with various IDs. I think voxceleb2 with 6k IDs will do.

@flyingshan Ok, did you use the model provided by @johndpope, or implement it by yourself? I also implement the cycle loss, but not used yet, because I want to the model first to have a good generalization ability to unssen IDs when doing self-reenactment tasks, then pay more attnetion to cross-ID reenactment cases. Otherwises, I think it makes no sense to use cycle loss when the model is unable to do self-reenactment tasks to seen/unseen IDs.

I implement G3d by myself according to the paper. The warping module is implemented according to Kwentar. As for Eapp/G2d, I adopt the structure in face-vid2vid, and modify some structures to make them look like the structure in megaportrait.

@flyingshan can you share some visual results here if possible?

I only got some visualizations in the evaluation process. [source/prediction/drive]

The results seems like mine. The lip and mouth areas are not driven well. I follow the paper to implement it.

JZArray

I have implemented the cycle loss mentioned in the paper, but I want to first solve the expression fitting problem now so this loss is not included in the training yet. To generalize to unseen ID, I think it is important to have a large dataset with various IDs. I think voxceleb2 with 6k IDs will do.

@flyingshan Ok, did you use the model provided by @johndpope, or implement it by yourself? I also implement the cycle loss, but not used yet, because I want to the model first to have a good generalization ability to unssen IDs when doing self-reenactment tasks, then pay more attnetion to cross-ID reenactment cases. Otherwises, I think it makes no sense to use cycle loss when the model is unable to do self-reenactment tasks to seen/unseen IDs.

I implement G3d by myself according to the paper. The warping module is implemented according to Kwentar. As for Eapp/G2d, I adopt the structure in face-vid2vid, and modify some structures to make them look like the structure in megaportrait.

@flyingshan can you share some visual results here if possible?

I only got some visualizations in the evaluation process. [source/prediction/drive]

The results seems like mine. The lip and mouth areas are not driven well. I follow the paper to implement it.

your background looks well! How many IDs have you used for training to get such results, and in your evaluation results, did these IDs also appear in train dataset? Also, have you tried reenactment between different IDs, does IDs leakage appera? (could your share codes if possible?)

coachqiao2018 commented 2 weeks ago

JZArray

I have implemented the cycle loss mentioned in the paper, but I want to first solve the expression fitting problem now so this loss is not included in the training yet. To generalize to unseen ID, I think it is important to have a large dataset with various IDs. I think voxceleb2 with 6k IDs will do.

@flyingshan Ok, did you use the model provided by @johndpope, or implement it by yourself? I also implement the cycle loss, but not used yet, because I want to the model first to have a good generalization ability to unssen IDs when doing self-reenactment tasks, then pay more attnetion to cross-ID reenactment cases. Otherwises, I think it makes no sense to use cycle loss when the model is unable to do self-reenactment tasks to seen/unseen IDs.

I implement G3d by myself according to the paper. The warping module is implemented according to Kwentar. As for Eapp/G2d, I adopt the structure in face-vid2vid, and modify some structures to make them look like the structure in megaportrait.

@flyingshan can you share some visual results here if possible?

I only got some visualizations in the evaluation process. [source/prediction/drive]

The results seems like mine. The lip and mouth areas are not driven well. I follow the paper to implement it.

JZArray

I have implemented the cycle loss mentioned in the paper, but I want to first solve the expression fitting problem now so this loss is not included in the training yet. To generalize to unseen ID, I think it is important to have a large dataset with various IDs. I think voxceleb2 with 6k IDs will do.

@flyingshan Ok, did you use the model provided by @johndpope, or implement it by yourself? I also implement the cycle loss, but not used yet, because I want to the model first to have a good generalization ability to unssen IDs when doing self-reenactment tasks, then pay more attnetion to cross-ID reenactment cases. Otherwises, I think it makes no sense to use cycle loss when the model is unable to do self-reenactment tasks to seen/unseen IDs.

I implement G3d by myself according to the paper. The warping module is implemented according to Kwentar. As for Eapp/G2d, I adopt the structure in face-vid2vid, and modify some structures to make them look like the structure in megaportrait.

@flyingshan can you share some visual results here if possible?

I only got some visualizations in the evaluation process. [source/prediction/drive]

The results seems like mine. The lip and mouth areas are not driven well. I follow the paper to implement it.

your background looks well! How many IDs have you used for training to get such results, and in your evaluation results, did these IDs also appear in train dataset? Also, have you tried reenactment between different IDs, does IDs leakage appera? (could your share codes if possible?)

These results are sampled during training, I didn't perform quantitative evaluations for now, because the mouth and eye areas are bad, and the code needs more improvements. About 40,000 IDs for training, not from the voxceleb dataset. I plan to reprocess our dataset and remove backgrounds to train the model.

JZArray commented 2 weeks ago

JZArray

I have implemented the cycle loss mentioned in the paper, but I want to first solve the expression fitting problem now so this loss is not included in the training yet. To generalize to unseen ID, I think it is important to have a large dataset with various IDs. I think voxceleb2 with 6k IDs will do.

@flyingshan Ok, did you use the model provided by @johndpope, or implement it by yourself? I also implement the cycle loss, but not used yet, because I want to the model first to have a good generalization ability to unssen IDs when doing self-reenactment tasks, then pay more attnetion to cross-ID reenactment cases. Otherwises, I think it makes no sense to use cycle loss when the model is unable to do self-reenactment tasks to seen/unseen IDs.

I implement G3d by myself according to the paper. The warping module is implemented according to Kwentar. As for Eapp/G2d, I adopt the structure in face-vid2vid, and modify some structures to make them look like the structure in megaportrait.

@flyingshan can you share some visual results here if possible?

I only got some visualizations in the evaluation process. [source/prediction/drive]

The results seems like mine. The lip and mouth areas are not driven well. I follow the paper to implement it.

JZArray

I have implemented the cycle loss mentioned in the paper, but I want to first solve the expression fitting problem now so this loss is not included in the training yet. To generalize to unseen ID, I think it is important to have a large dataset with various IDs. I think voxceleb2 with 6k IDs will do.

@flyingshan Ok, did you use the model provided by @johndpope, or implement it by yourself? I also implement the cycle loss, but not used yet, because I want to the model first to have a good generalization ability to unssen IDs when doing self-reenactment tasks, then pay more attnetion to cross-ID reenactment cases. Otherwises, I think it makes no sense to use cycle loss when the model is unable to do self-reenactment tasks to seen/unseen IDs.

I implement G3d by myself according to the paper. The warping module is implemented according to Kwentar. As for Eapp/G2d, I adopt the structure in face-vid2vid, and modify some structures to make them look like the structure in megaportrait.

@flyingshan can you share some visual results here if possible?

I only got some visualizations in the evaluation process. [source/prediction/drive]

The results seems like mine. The lip and mouth areas are not driven well. I follow the paper to implement it.

your background looks well! How many IDs have you used for training to get such results, and in your evaluation results, did these IDs also appear in train dataset? Also, have you tried reenactment between different IDs, does IDs leakage appera? (could your share codes if possible?)

These results are sampled during training, I didn't perform quantitative evaluations for now, because the mouth and eye areas are bad, and the code needs more improvements. About 40,000 IDs for training, not from the voxceleb dataset. I plan to reprocess our dataset and remove backgrounds to train the model.

how long have you trained to get such results? Could you show some details about how you implement warp operations?

coachqiao2018 commented 2 weeks ago

JZArray

I have implemented the cycle loss mentioned in the paper, but I want to first solve the expression fitting problem now so this loss is not included in the training yet. To generalize to unseen ID, I think it is important to have a large dataset with various IDs. I think voxceleb2 with 6k IDs will do.

@flyingshan Ok, did you use the model provided by @johndpope, or implement it by yourself? I also implement the cycle loss, but not used yet, because I want to the model first to have a good generalization ability to unssen IDs when doing self-reenactment tasks, then pay more attnetion to cross-ID reenactment cases. Otherwises, I think it makes no sense to use cycle loss when the model is unable to do self-reenactment tasks to seen/unseen IDs.

I implement G3d by myself according to the paper. The warping module is implemented according to Kwentar. As for Eapp/G2d, I adopt the structure in face-vid2vid, and modify some structures to make them look like the structure in megaportrait.

@flyingshan can you share some visual results here if possible?

I only got some visualizations in the evaluation process. [source/prediction/drive]

The results seems like mine. The lip and mouth areas are not driven well. I follow the paper to implement it.

JZArray

I have implemented the cycle loss mentioned in the paper, but I want to first solve the expression fitting problem now so this loss is not included in the training yet. To generalize to unseen ID, I think it is important to have a large dataset with various IDs. I think voxceleb2 with 6k IDs will do.

@flyingshan Ok, did you use the model provided by @johndpope, or implement it by yourself? I also implement the cycle loss, but not used yet, because I want to the model first to have a good generalization ability to unssen IDs when doing self-reenactment tasks, then pay more attnetion to cross-ID reenactment cases. Otherwises, I think it makes no sense to use cycle loss when the model is unable to do self-reenactment tasks to seen/unseen IDs.

I implement G3d by myself according to the paper. The warping module is implemented according to Kwentar. As for Eapp/G2d, I adopt the structure in face-vid2vid, and modify some structures to make them look like the structure in megaportrait.

@flyingshan can you share some visual results here if possible?

I only got some visualizations in the evaluation process. [source/prediction/drive]

The results seems like mine. The lip and mouth areas are not driven well. I follow the paper to implement it.

your background looks well! How many IDs have you used for training to get such results, and in your evaluation results, did these IDs also appear in train dataset? Also, have you tried reenactment between different IDs, does IDs leakage appera? (could your share codes if possible?)

These results are sampled during training, I didn't perform quantitative evaluations for now, because the mouth and eye areas are bad, and the code needs more improvements. About 40,000 IDs for training, not from the voxceleb dataset. I plan to reprocess our dataset and remove backgrounds to train the model.

how long have you trained to get such results? Could you show some details about how you implement warp operations?

About 200 epochs, and one week. My warp operation correspond to the above results are similar with johndpope's, but I predict 6d pose and use tanh to constrain the rotation and translation. which is copied from pytorch3d. This is my attempt according to EMOPortrait.

johndpope / MegaPortrait-hack

Current results of training - epoch 4 #36