Open fenghe12 opened 2 months ago
Hi @fenghe12
im very active to build out IMF neural video codec - it is another microsoft paper - (not stable diffusion) https://github.com/johndpope/IMF/branches i got the model working / training - in a way - it's superior to megaportraits - no keypoints / no warping. it's decoder entric. have a read of paper - it's quite intriguing - and ive been able to plug / upgrade different modules to make it better.
here's the training for IMF https://wandb.ai/snoozie/IMF/runs/xscj3hjo?nw=nwusersnoozie
the reconstructed image is 32 floats with some stylegan modulation. its very light weight. its working - but im struggling to get into onto any client (wasm / ios / onnx ...) without breaking the model or degrading to unusable.
hopefully google can fix this - https://github.com/google-ai-edge/ai-edge-torch/issues/305
https://github.com/AlexanderLutsenko/nobuco there's this library to convert pytorch to tensorflowjs - but this is a real head ache because tensorflow uses BHWC - and pytorch uses BCWH so all the logic is flip flopped around.
woking this paper - the performance of VASA is kinda unique. I almost exhaust IMF now - and circle back to tack another look at this paper with fresh eyes.
UPDATE
i dump a bunch of fresh code from claude - plan is to get the dataset working / validating..... and then wire up the training . https://github.com/johndpope/VASA-1-hack/blob/main/dataset_testing.py
theres some flux here in code models I need to adjust code to use yaml configs / accelerate. https://github.com/johndpope/VASA-1-hack/blob/main/train.py#L746
UPDATE - i add a VASADatasetTester - the dataset / testing emotion detection it's running and passing
UPDATE
i cherry pick the models from megaportraits to do the encoding with stage 1
python train_stage_1.py
but hitting OOM - dont remember this being broken - have to debug https://github.com/johndpope/MegaPortrait-hack
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 23.59 GiB of which 173.75 MiB is free. Process 4093651 has 262.20 MiB memory in use. Including non-PyTorch memory, this process has 20.43 GiB memory in use. Of the allocated memory 20.02 GiB is allocated by PyTorch, and 38.79 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Update - Monday 5th November I partially solve memory problem https://wandb.ai/snoozie/megaportraits/overview
Its training stage 1 - here - overfitting example It has updated code to do warping.
Next step - attempt to import this into stage 2. There’s a million more things to do for stage 1 to match megaportrait paper They have high res / distillation training teacher / student - notably missing The emoportraits has many losses can also add.
but I’m more interested to test out my latest vasa motion generator code.
Update Nov 19th
So I abandon the megaportraits code / logic and cherry pick the emo portraits volumetric avatar this is SOTA albeit crippled with creative commons license - I have some code that is not released (i dont want my code tainted with CC)
Canonical Volume: [B, T, C, D, H, W] = [1, 50, 96, 16, 64, 64] Size = 1 50 96 16 64 64 4 bytes (float32) ≈ 100MB per window
ID Embed: We only save one per video, so this is negligible
For a 5 second video at 30fps = 150 frames:
Number of windows = (150 - 50) / 25 + 1 = 5 windows (due to 50% overlap) Total data per video = 5 windows * 100MB = ~500MB
to get this into the diffusion transformer - i hit OOM errors - and basically hit a wall with 3090 gpu. i had a rethink - to extract the stage 1 features up front and save to h5 file. I'm gob smacked how much data is necessary to store this. looking to tweak this somehow before attempting stage 2 training again.
we collected some face videos data and constructed a new dataset(named FaceVid-1K),maybe available in next month.