johndpope / IMF

Implicit Motion Function - (unofficial) Microsoft recreation
https://openaccess.thecvf.com/content/CVPR2024/papers/Gao_Implicit_Motion_Function_CVPR_2024_paper.pdf
11 stars 1 forks source link

Is the paper reproducible? #25

Closed andyl-flwls closed 1 month ago

andyl-flwls commented 2 months ago

Appreciate your hard working, I have checked some of the running samples, I am not quite sure if IMF is reproducible, would you share some more details?

johndpope commented 2 months ago

Hi Andy,

you maybe able to help with fresh eyes. I was in limbo for a week to get confirmation on model arcihtecture by microsoft and lost momentum. https://github.com/hologerry/IMF/issues/4

the main branch has an error in the model. this branch has critical fixes - https://github.com/johndpope/IMF/blob/laced-yogurt-1708/model.py#L243

https://github.com/johndpope/IMF/pull/26 this is the model working but.... https://wandb.ai/snoozie/IMF/runs/l0hko9nn?nw=nwusersnoozie

Screenshot 2024-08-25 at 4 03 37 AM

I did all this extra work to get the codebase to train via stylegan2 - ada / ema - but according to author - this is not necessary. there's no noise. if I had known this before - it could have saved me a 1000 test runs - but you live and learn. there's also no gradient clipping in their version.

when the training finishes the first movie - it falls over and training stops.

Following this - I looked at codebase and attempted to bypass gan training altogether - ( stylegan training would be necessary when introducing the token manipulating https://github.com/johndpope/IMF/blob/9009deafd9443a5df57d7b51ba355ce7533eee94/model.py#L358 ) you will see here - this has both gan - and another version - with no discriminator. just optimise to lpips perceptual loss. https://github.com/johndpope/IMF/blob/laced-yogurt-1708/train.py

the crazy thing is - (either way when i train gan or no_gan. when i train on the one video - it cycles 5 loops - and quality goes high up - but when video changes - model collapses - and there's no gradient flow and no new images can be reproduced.\ I start monitoring images each loop - to inspect the training. i think this could be fixed by moving on to another video sooner - but i just hit a wall. i dont know why it breaks - and gradient clipping that was helping before didn't help here. My contract the other week came to an end and ive been working on this in my spare time - so need to focus on getting another job.

I rebuilt this paper at least 2 times - I'm not convinced redoing it with rosalinty stylegan2 codebase will fix things. that code is only 70kb - very small. it's possible - but i'm experiencing same problem with or without stylegan.

@tanshuai0219 has been looking to recreate this paper using LIA.

side note - when I was looking at this model - and working with my 3090 gpu / vram requirements - it occured to me that there's could be an opportunity to create a patch grid at the front of the architecture. without changing any model - just superficially sticking a 2x2 grid 4x in front of the entire architecture.

johndpope commented 2 months ago

so i throw out a week or twos work and switch back some foundational code

https://github.com/johndpope/IMF/pull/27

the other branches kept falling in a hole after the video changed.

I don't know why - but anyway - this is working (for now).

image

The other problem i had was after about 60 epochs - the model has been disintergrating....

BUT - i update the mixed precision to be off - and hopefully that helps - as it was blowing up the losses into the 20,000 to start - now the losses are contained.... Epoch 1/1000: 239it [01:34, 2.54it/s, G Loss=1.9744, D Loss=0.1657]

this is the latest run here - https://wandb.ai/snoozie/IMF/runs/l4u8rtii?nw=nwusersnoozie

UPDATE Jury is out whether or not it's working.... Screenshot from 2024-08-26 14-01-40

thought it was broken (media 10) - but the next frame working 12.... Screenshot from 2024-08-26 14-04-40

I've seen runs after a few hours go down hill....

UPDATE

in that branch - I restore another file resblocks.py (the imports are not included)

these residual network blocks kind of EXACTLY match the document / specifications from Microsoft. the training is not using them but I was faffing around switching them in - it's only after 60 epochs the model was falling apart.

at 42 here - it looks good without them. Screenshot from 2024-08-26 14-24-38

I want the quality to be better image

andypinxinliu commented 2 months ago

hi,

based on the current result. I still feel like not working properly. The reconstructed is not based on the driving source but more likely to be based on the source image. The difference is very little and seems like if the difference between the source and driving is large, then will be more blurry. Recently I am working on TPS, and TPS can train very fast for the explicit motion change given the driving image, though could be a little bit blurry for some facial regions. Also, I am not sure if the batch size is the thing that matters here. In general, StyleGAN is a model with at least 32 batch size for training.

johndpope commented 2 months ago

i mostly agree - there's no driving image per se' it's not MegaPortrait's https://github.com/johndpope/MegaPortrait-hack

it's using the reference image - then interpretting the current image as a tiny compressed version 32 bytes - and then recreating that current image. its more of a codec compression - only you can then hotswap in different latent codes.

I have it with batch size 2 - training 3090 - I have access to vertex - can fire up an a100 + 40gb and throw some compute - i just want the model finalized.....

UPDATE - have a look at the supplementary material of paper - they showcase different latents being swapped in.

UPDATE - blows up - 64 onwards.... it must be a resnet thing.....i dont know where - i dont know why.... Screenshot from 2024-08-26 15-29-24

fyi - I introduce a script run.sh that commits changes to align with the wandb test / which records the git commit - it's kinda of atomic transaction so the test can be recreated...this was from experience where i would get a good run - and attempt to redo - and it would fail. the other thing i found when chopping / changing code / switching branches is to freshly clear out the cache - as a sanity check

    echo "Clearing __pycache__"
    rm -rf __pycache__

Screenshot from 2024-08-26 16-35-09

then if you do a git checkout - it should align to the test run. (but sometimes it doesn't - dont know if its my gpu failing me?) Screenshot from 2024-08-26 16-35-47

UPDATE I thought maybe my gpu was flaky - but I rerun on a100 this morning and model trips up I introduced some gradient monitoring in one of the branches to abort if there's flow. I don't know why it's doing this.... first image is training - then the next sample is black image / etc.... https://wandb.ai/snoozie/IMF/runs/z0e2thv6?nw=nwusersnoozie

there was some crazy thing going on in the train.py with the losses I redo this - it's not giving me clearest pictures - but it's not collapsing to black image on the next video... https://github.com/johndpope/IMF/pull/28

johndpope commented 2 months ago

Good news -

I finally found problem with my videodataset where the source images were blank - so some batches were contaminating the training - that would help explain the above random red / blue / black images propagating....

https://wandb.ai/snoozie/IMF/runs/zvj8lbuu?nw=nwusersnoozie so this is good news - was going crazy... the other thing with this run is after 60 steps - the model fell apart... this doesn't seem to be happening.

im stilling seeing some defects in the eyes - i read this can be side effect of the loss from gan. from paper it's set to 10 - perceptual / 10 pixel loss / 1 gan loss.

Screenshot from 2024-08-28 13-38-39

to increase quality - can push out the video repeats - understand this is overfitting the model.

I had this num_frames set to 200 frames for above test run - it will just repeat frame if there's not enough in mp4.


class WebVid10M(Dataset):
    def __init__(
            self,
            video_folder,
            sample_size=256,
            num_frames=300  # Set a fixed number of frames per video

then add the video editing functionality.... upload some mp4s to google storage - and throw some compute.

UPDATE

I ramp up lpips loss 10x perceptual loss / pixel loss = 5 and adv-gan loss 0.5 i increase video repeat to 2.

UPDATE - I cancel training looks ok - but quality still missing. this is in the main branch - and should be stable.

I play around in the resnet branch....

(i think the resblocks updates will make a difference - yet to drop these in.)

so reduceonplateau is not the best scheduler for gan - i throw in cosineannealing... and see what happens.


        self.scheduler_g = ReduceLROnPlateau(self.optimizer_g, mode='min', factor=0.5, patience=5, verbose=True)
        self.scheduler_d = ReduceLROnPlateau(self.optimizer_d, mode='min', factor=0.5, patience=5, verbose=True)

can also turn off gradient clipping - they supposedly dont use....

(this might fall over - yet to successfully do a test run with new resnet code) https://wandb.ai/snoozie/IMF/runs/d8jdtcjh ) https://github.com/johndpope/IMF/tree/fix/resnet

UPDATE so that resnet changes caused a blue screen on image 55. I switch this back to use original resnet code.

in the StyledConv - i noticed I was using an inferior modulation - I update to align to stylegan2 - lucidrains.

https://wandb.ai/snoozie/IMF/runs/li4m8pc7?nw=nwusersnoozie

N.B. I switch off the gradient clipping - then suddenly start seeing blue screen - it maybe that not the new resnet code.

UPDATE i push the iteration to 5 watches / cycles per video to see how much better quality is possible. as the training progresses - it gets better at recreating scenes - new ones throw it. https://wandb.ai/snoozie/IMF/runs/x2tkyhhf

https://github.com/johndpope/IMF/pull/30

this normalizing plays with the light gray of image transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225], inplace=True)

there's some code to unnormalize this when saving image to wandb. but it's sometimes playing up...

UPDATE

I introduce this branch to help overcome problematic training steps / exploding gradients. https://wandb.ai/snoozie/IMF/runs/fyk2cfti?nw=nwusersnoozie

https://github.com/johndpope/IMF/tree/fix/profile-step re-running training now.

johndpope commented 1 month ago

fyi - i switch in a bunch of resblocks / convlayers from LIA

https://github.com/johndpope/IMF/pull/31

https://raw.githubusercontent.com/wyhsirius/LIA/db7a2e974a177eb8470e500527f940987199ad76/networks/encoder.py

these directly align with comments made here by HoloGerry https://github.com/hologerry/IMF/issues/4

the ModulatedConv2d/StyledConv inside resblocks.py comes from pytorch stylegan2 - lucidrains -

my wandb is not correctly sampling images (not sure why) - but after a few minutes - seeing promising results.. image

this should 100% align now resblock2

there's also some leakyrelufused code that's been added (from lia code).

UPDATE - may have fixed the sampling here....testing now https://wandb.ai/snoozie/IMF/table?nw=nwusersnoozie

UPDATE - Sep 2 so looking more closer at the lia code - they have the styledconv / modulated - so i replace this code as well https://wandb.ai/snoozie/IMF/runs/01vvfows?nw=nwusersnoozie

this looks encouraging at 26 - https://wandb.ai/snoozie/IMF/runs/01vvfows?nw=nwusersnoozie

johndpope commented 1 month ago

i think from my testing - the results are actually mint - needs more training / compute / different videos.... I close this ticket - and open this other one around mode collapse / blue recreated image...

https://github.com/johndpope/IMF/issues/34

also let me know what you want to see next https://github.com/johndpope/IMF/issues/32

was think of doing the 3dmm alignment - maybe able to hot wire this into the training https://github.com/PowerHouseMan/ComfyUI-AdvancedLivePortrait

but actually kinda excited to play with sam2 in conjunction with this codebase / architecture. idea would be to take the masks - augment the image generation with this higher level latents - like lazy implicit motion function - where it's only redrawing the moving part....

UPDATE - was seeing the mode collapse issue this morning I refactor some things to use the original discriminator from the paper - still breaking... https://wandb.ai/snoozie/IMF/runs/nh3zc28s?nw=nwusersnoozie

UPDATE - September 4th @andyl-flwls bit of breakthrough - i had some code that was adjusting the learning rate - this was conflicting with auto learning rate - i switch this off - and also adjust where the steppers step - so far so good. that and some detection for problematic images in dataset.

OK - this is the latest training - In previous code i had it iterating multiple samples x5 to boost the quality - instead in this - i opt for using the epochs instead - so it should do same thing - its currently set to 1000 epochs - on 35,000 videos - could take years ..... but by epoch 5 - should be same repeat 5 videos. https://wandb.ai/snoozie/IMF/runs/892ufzr3?nw=nwusersnoozie

looking good to fire up proper gpu compute now..... could take a week or two on a100 cluster

johndpope commented 1 month ago

unless this training blows up - calling the paper (partially) reproduced. again the quality should boost on second / third / fourth epoch. https://wandb.ai/snoozie/IMF/runs/892ufzr3?nw=nwusersnoozie

I do the token editing once we choose what to do.... age / pose / style transfer? https://github.com/johndpope/IMF/issues/32

kinda interested to make it like Emote - maybe audio ?

https://www.youtube.com/watch?v=lR3YwRMuaYQ

Just realized the above current video dataset is subset of 100 videos. Not the full 35,000 from celebrating

Update The training did blow up at 20,000 steps - the log is in above Wandb - Strangely the recreations start off on right path recreating current image but then inexplicably skew back to the reference image. I check tomorrow.

UPDATE 6th september I run some over fitting training on 1 video with Selena Gomez (she's the first video in my celebhq dataset) - still running.... https://wandb.ai/snoozie/IMF/runs/xscj3hjo?nw=nwusersnoozie

when i inspect - sometimes it's (gradually) bang on and it's recreating the face successfully Screenshot from 2024-09-06 06-14-43

but sometimes - here the middle image - is looking right (reference image also looking right)/ yet the recreated image is completely off -
Screenshot from 2024-09-06 06-19-59

I attempt to override this with a face loss bias - (factoring in eyes / mouth / nose position) https://github.com/johndpope/IMF/tree/feat/face_loss_bias

it helped me recreating eye loss the other day - to boost some images which i was troubleshooting resnet. the authors didn't design this model specifically for head talking - they focus on image recreation - but this is with endless compute.

it may make sense to do some image flipping in the / image augmentations to speed up convergence for head taking....

the other thinking is redoing frames where there's a sort of miss where it gets the recreation completely wrong.... the existing losses should eventually fix things - but if her head is facing right - recreated image should follow suit....so redo that frame 1000x times to fix it....

Screenshot from 2024-09-06 08-21-12

( Because in this training above - it's over fitting to the one video - the latent encoder may not be getting the best signature of the reference.... it should perform better when trained across more videos.)

UPDATE - this seems fine to me - going to stop training.... Screenshot from 2024-09-06 10-51-31

i need to figure out what's wrong with 20,000 steps that broke training before....

using that trained (overfitted) checkpoint - I switch back to 1 video and kick off this training... https://wandb.ai/snoozie/IMF/runs/bzqo89ha

I push some code to redo worst frames during training based off face reconstruction https://github.com/johndpope/IMF/tree/feat/face_loss_bias_worst_frames