johndpope / SPEAK-hack

Using Claude Sonnet to reverse engineer paper Listen, Disentangle, and Control: Controllable Speech-Driven Talking Head Generation
https://arxiv.org/pdf/2405.07257
7 stars 0 forks source link

training checkpoint - 5500 (1 hour on 3090) #1

Open johndpope opened 5 months ago

johndpope commented 5 months ago

image

Screenshot from 2024-06-28 00-43-05

i had to rework the generator to use less layers / and use 64 x 64 image resizing.

Screenshot from 2024-06-28 00-45-17

johndpope commented 5 months ago

@fenghe12 / @JaLnYn / @ChenyangWang95
this might actually work.

In megaportaits I use custom resnet50 probably safer to switch that in because otherwise the model is going to just discard the updates???. I check in the morning.

francqz31 commented 5 months ago

@johndpope is it just me or sonnet 3.5 machine learning code output is actually way more readable than opus ? feels like actual working code this time !!!

johndpope commented 5 months ago

something maybe not quite right. i train overnight. this is still epoch 0 -

checkpoint-86500 recon_step_86500

Screenshot from 2024-06-28 07-49-49

i change the code back to use 512x512 resume training - and get this.

recon_step_87000

im seeing newer clearer images advancing in epoch 1 - even after a few more cycles - will udpate here later. i think by epoch 4 - probably going to be fairly decent.

i add some tensorboard stuff - and surface the losses.

recon_step_126000.png image

UPDATE - my bad was overfitting to one image. I just push updated dataloader. new debug image. debug_step_164000

Starting training again. was seeing OOM errors - check your num_of_workers. debug_step_168000

UPDATE - i restart training - I change the generator to use resblocks - maybe will help recreate the image better.

Screenshot from 2024-06-28 16-40-33

debug_step_4000

UPDATE - Sunday so i rebuilt code to do progressive training with resolution upscaling - 64,128....256 ...512 added tensorboard losses Screenshot from 2024-06-30 05-43-40

i give up training across celebA - i overfit to one pair of images....

training progress so far Screenshot from 2024-06-30 14-10-09

UPDATE - Sunday night

so had some battle with gradient explosions

ending up having to add some accummulation steps in that helped stablize things https://github.com/johndpope/SPEAK-hack/pull/3

looks like the learning rate is getting things into a minima.... debug_step_3500_resolution_64

UPDATE - i switch to use 256 because resnet50 cant return rich features 2048,7,7 for images less than 224x224.

Screenshot from 2024-07-01 09-50-51