johndpope / VASA-1-hack

Using Claude Opus to reverse engineer code from VASA white paper - WIP - (this is for La Raza 🎷)
https://www.microsoft.com/en-us/research/project/vasa-1/
MIT License
206 stars 24 forks source link

Are you planning on training? #5

Open zsxkib opened 5 months ago

zsxkib commented 5 months ago

Lmk if you're planning on training, i could maybe help

johndpope commented 5 months ago

I want the voxceleb2 dataset - doesn’t seem available anymore - torrents are dead. - i did do some data loader on my emote-hack repo - i will wire it up in a few days

realistically- better to press for voodoo code to be released. This is kinda just academic excercise.

francqz31 commented 5 months ago

Hey, I just found your repo , tell me if those work for you: this should be full voxceleb2 dataset 1-URLs and timestamps: https://fex.net/s/lmaobde

2-Audio files: Dev A: Download Dev B: Download Dev C: Download Dev D: Download Dev E: Download Dev F: Download Dev G: Download Dev H: Download dev: Concatenated Test: Download Download all parts and concatenate the files using the command cat vox2_dev_aac* > vox2_aac.zip.

Video files: Dev A: Download Dev B: Download Dev C: Download Dev D: Download Dev E: Download Dev F: Download Dev G: Download Dev H: Download Dev I: Download Dev: Concatenated
Test: Download Download all parts and concatenate the files using the command cat vox2_dev_mp4* > vox2_mp4.zip.

johndpope commented 5 months ago

actually on holiday - away from workstation with cuda - so can't run this

these edits were from chatgpt - nowdays almost exclusively using claude

#         # Generate holistic facial dynamics using the diffusion transformer
#         audio_features = batch['audio']
#         gaze_direction = batch['gaze']
#         head_distance = batch['distance']
#         emotion_offset = batch['emotion']

if you look here - claude spat this out - and it seems more closely aligned to VASA paper. 

https://github.com/johndpope/VASA-1-hack/blob/main/train.py
```python
# # Extract keypoints from the generated dynamics
# kp_s = generated_dynamics[:, :, :3]  # Source keypoints
# kp_d = generated_dynamics[:, :, 3:]  # Driving keypoints

# # Compute the rotation matrices
# Rs = torch.eye(3).unsqueeze(0).repeat(kp_s.shape[0], 1, 1)  # Source rotation matrix
# Rd = torch.eye(3).unsqueeze(0).repeat(kp_d.shape[0], 1, 1)  # Driving rotation matrix

# # Call the MotionFieldEstimator
# deformation, occlusion, occlusion_2 = motion_field_estimator(appearance_volume, kp_s, kp_d, Rs, Rd)

have to plug this back in as context to claude. https://github.com/johndpope/VASA-1-hack/blob/5532d1d2324053900b3a2f73ba2ed9e160fd8b0d/modules/real3d/facev2v_warp/model.py#L137

trithucxx commented 5 months ago

I want the voxceleb2 dataset - doesn’t seem available anymore - torrents are dead. - i did do some data loader on my emote-hack repo - i will wire it up in a few days

in the diffusion transformer architecture- from what I understand- they use patches - and I don’t see those in code (spat out by Claude)

realistically- better to press for voodoo code to be released. This is kinda just academic excercise.

I have been downloaded, still work. Anw, how can I start/run your project bro?

johndpope commented 5 months ago

Hi @trithucxx -

I'm looking at booting up MegaPortrait by upgrading the training for this repo - https://github.com/johndpope/MegaPortrait/ @kevinFringe had used a couple of directories - but I have some code in the works with decord / mp4s https://github.com/johndpope/Emote-hack/blob/main/Net.py#L1085

For now - this model Eapp1 needs to be 100% - otherwise everything else isn't going to work. Or maybe this volumetric can be sourced from other repo? can this do it? IDK - https://real3dportrait.github.io/

This is the first part of the Appearance Encoder. To generate a 4D tensor of volumetric features vs. https://github.com/johndpope/MegaPortrait/blob/master/model.py#L82

UPDATE Im pretty sure we can piggy back off the VOODOO3D paper (code in june)

trithucxx commented 5 months ago

Hi @trithucxx -

I'm looking at booting up MegaPortrait by upgrading the training for this repo - https://github.com/johndpope/MegaPortrait/ @Kevinfringe had used a couple of directories - but I have some code in the works with decord / mp4s https://github.com/johndpope/Emote-hack/blob/main/Net.py#L1085

For now - this model Eapp1 needs to be 100% - otherwise everything else isn't going to work. Or maybe this volumetric can be sourced from other repo? can this do it? IDK - https://real3dportrait.github.io/

This is the first part of the Appearance Encoder. To generate a 4D tensor of volumetric features vs. https://github.com/johndpope/MegaPortrait/blob/master/model.py#L82

UPDATE Im pretty sure we can piggy back off the VOODOO3D paper (code in june)

Screenshot 2024-04-29 at 10 14 18 pm

I tested real3dportrait, it's seems to be inaccurate and the video take 3h for 2m talking vid completion time (too long). How about torrent you can not download. Hope to see your project run.

johndpope commented 4 months ago

so a few days ago i was looking at some other code - basically claude thinks there's enough to avoid needing the megaportrait code - specifically the 4D tensor of volumetric features this supposedly handles it.

self.appearance_extractor = AppearanceFeatureExtractor()


class AppearanceFeatureExtractor(nn.Module):
    # 3D appearance features extractor
    # [N,3,256,256]
    # [N,64,256,256]
    # [N,128,128,128]
    # [N,256,64,64]
    # [N,512,64,64]
    # [N,32,16,64,64]
    def __init__(self, model_scale='standard'):
        super().__init__()
        use_weight_norm = False
        down_seq = [64, 128, 256]
        n_res = 6
        C = 32
        D = 16
        self.in_conv = ConvBlock2D("CNA", 3, down_seq[0], 7, 1, 3, use_weight_norm)
        self.down = nn.Sequential(*[DownBlock2D(down_seq[i], down_seq[i + 1], use_weight_norm) for i in range(len(down_seq) - 1)])
        self.mid_conv = nn.Conv2d(down_seq[-1], C * D, 1, 1, 0)
        self.res = nn.Sequential(*[ResBlock3D(C, use_weight_norm) for _ in range(n_res)])

        self.C, self.D = C, D

    def forward(self, x):
        x = self.in_conv(x)
        x = self.down(x)
        x = self.mid_conv(x)
        N, _, H, W = x.shape
        x = x.view(N, self.C, self.D, H, W)
        x = self.res(x)
        return x

image

francqz31 commented 4 months ago

with all do respect , but i don't actually believe that OPUS or even current llms (gpt-4 turbo-Opus-google's latest thing whatever the name-llama 3 400B) etc can accurately implement machine learning papers , I tried it multiple times and it just misses so many points ,and makes really simple mistakes like as if it doesn't even have a clue what it is writing , good thing Mr john that you document every step . I think your best shot will be with gpt-5 , i think in order to have an advanced llm implement a machine learning paper , you gotta have some kind of agentic thing like devin but with a reasoning of gpt-5 for example , you provide the paper and a similar code to the paper you wanna implement (for example you upload vasa-1 paper and make it fully read Audio2Head code) and then it will start developing off of it , just like professional software engineers ? what do you think Mr John?

francqz31 commented 4 months ago

if gpt-5 can't do that , then good luck having any kind of llm implement any machine learning paper before 2026

johndpope commented 4 months ago

@francqz31 i agree with most of your thoughts. The world will be different place when gpt5 drops. I’d add - don’t use chatgpt4 / use opus - and if the code it’s spitting out is / or feels off / discard the chat and start a fresh with updates. Eg base code + paper / increment logic / llm goes off on wrong tangent / discard chat / feed it updated code and even given it more context / header files or relevant code form other repos etc.

I completely rebuild megaPortrait codebase - https://github.com/johndpope/megaPortrait-hack need to wire up the dataloaders. can't decide on best approach. https://github.com/johndpope/MegaPortrait-hack/issues/2

UPDATE - I found some loss functions from SamsungLabs in Rome repo

This work at SamsungLabs - would flow on from MegaPortraits. Screenshot from 2024-05-13 08-24-07

UPDATE @francqz31 - maybe too early to call it - but I just start training MegaPortrait https://github.com/johndpope/MegaPortrait-hack

Screenshot from 2024-05-13 23-12-38

johndpope commented 4 months ago

ok - so took me a month - but i believe i got the dependent paper MegaPortraits implmented. https://github.com/johndpope/MegaPortrait-hack/tree/main there's actually going to be a new code upgrade/ with video data from FB dropping in July 24 - https://github.com/neeek2303/EMOPortraits

i am running local training on a couple of videos https://github.com/johndpope/MegaPortrait-hack/pull/21

the interesting thing with this paper is - there's no keypoints - it's all resnet feature maps with warping. UPDATE running some numbers passed chatgpt - 250 seconds / epoch - 200,000 epochs will take like 2 years on 3090. / 2 months on a h100.

UPDATE 2 - some warping code is taking long time - I chop it out for now.

https://github.com/johndpope/MegaPortrait-hack/pull/28

fenghe12 commented 3 months ago

do u still need talking head video dataset?we collected some

johndpope commented 3 months ago

hi @fenghe12 - sorry for late reply - been distracted recreating code for this paper - https://arxiv.org/pdf/2405.07257 https://github.com/johndpope/SPEAK-hack

I would appreciate any help in cross checking code with paper. I include some test inference code

if you want to share link to videos - happy to grab them.

johndpope commented 2 months ago

this paper by Microsoft - Implicit Motion Function https://openaccess.thecvf.com/content/CVPR2024/papers/Gao_Implicit_Motion_Function_CVPR_2024_paper.pdf

I recreate here https://github.com/johndpope/IMF

(assume it's all wrong -i had to switch in ResNets as feature extractor (it's not mentioned in paper) yet it seems to be converging) https://wandb.ai/snoozie/IMF/runs/f9o9vvje?nw=nwusersnoozie

UPDATE - sorry - this needs completely redoing - https://github.com/johndpope/IMF/tree/v1