Open zsxkib opened 5 months ago
I want the voxceleb2 dataset - doesn’t seem available anymore - torrents are dead. - i did do some data loader on my emote-hack repo - i will wire it up in a few days
in the diffusion transformer architecture- from what I understand- they use patches - and I don’t see those in code (spat out by Claude)
realistically- better to press for voodoo code to be released. This is kinda just academic excercise.
Hey, I just found your repo , tell me if those work for you: this should be full voxceleb2 dataset 1-URLs and timestamps: https://fex.net/s/lmaobde
2-Audio files: Dev A: Download Dev B: Download Dev C: Download Dev D: Download Dev E: Download Dev F: Download Dev G: Download Dev H: Download dev: Concatenated Test: Download Download all parts and concatenate the files using the command cat vox2_dev_aac* > vox2_aac.zip.
Video files:
Dev A: Download
Dev B: Download
Dev C: Download
Dev D: Download
Dev E: Download
Dev F: Download
Dev G: Download
Dev H: Download
Dev I: Download
Dev: Concatenated
Test: Download
Download all parts and concatenate the files using the command cat vox2_dev_mp4* > vox2_mp4.zip.
actually on holiday - away from workstation with cuda - so can't run this
these edits were from chatgpt - nowdays almost exclusively using claude
# # Generate holistic facial dynamics using the diffusion transformer
# audio_features = batch['audio']
# gaze_direction = batch['gaze']
# head_distance = batch['distance']
# emotion_offset = batch['emotion']
if you look here - claude spat this out - and it seems more closely aligned to VASA paper.
https://github.com/johndpope/VASA-1-hack/blob/main/train.py
```python
# # Extract keypoints from the generated dynamics
# kp_s = generated_dynamics[:, :, :3] # Source keypoints
# kp_d = generated_dynamics[:, :, 3:] # Driving keypoints
# # Compute the rotation matrices
# Rs = torch.eye(3).unsqueeze(0).repeat(kp_s.shape[0], 1, 1) # Source rotation matrix
# Rd = torch.eye(3).unsqueeze(0).repeat(kp_d.shape[0], 1, 1) # Driving rotation matrix
# # Call the MotionFieldEstimator
# deformation, occlusion, occlusion_2 = motion_field_estimator(appearance_volume, kp_s, kp_d, Rs, Rd)
have to plug this back in as context to claude. https://github.com/johndpope/VASA-1-hack/blob/5532d1d2324053900b3a2f73ba2ed9e160fd8b0d/modules/real3d/facev2v_warp/model.py#L137
I want the voxceleb2 dataset - doesn’t seem available anymore - torrents are dead. - i did do some data loader on my emote-hack repo - i will wire it up in a few days
in the diffusion transformer architecture- from what I understand- they use patches - and I don’t see those in code (spat out by Claude)
realistically- better to press for voodoo code to be released. This is kinda just academic excercise.
I have been downloaded, still work. Anw, how can I start/run your project bro?
Hi @trithucxx -
I'm looking at booting up MegaPortrait by upgrading the training for this repo - https://github.com/johndpope/MegaPortrait/ @kevinFringe had used a couple of directories - but I have some code in the works with decord / mp4s https://github.com/johndpope/Emote-hack/blob/main/Net.py#L1085
For now - this model Eapp1 needs to be 100% - otherwise everything else isn't going to work. Or maybe this volumetric can be sourced from other repo? can this do it? IDK - https://real3dportrait.github.io/
This is the first part of the Appearance Encoder. To generate a 4D tensor of volumetric features vs. https://github.com/johndpope/MegaPortrait/blob/master/model.py#L82
UPDATE Im pretty sure we can piggy back off the VOODOO3D paper (code in june)
Hi @trithucxx -
I'm looking at booting up MegaPortrait by upgrading the training for this repo - https://github.com/johndpope/MegaPortrait/ @Kevinfringe had used a couple of directories - but I have some code in the works with decord / mp4s https://github.com/johndpope/Emote-hack/blob/main/Net.py#L1085
For now - this model Eapp1 needs to be 100% - otherwise everything else isn't going to work. Or maybe this volumetric can be sourced from other repo? can this do it? IDK - https://real3dportrait.github.io/
This is the first part of the Appearance Encoder. To generate a 4D tensor of volumetric features vs. https://github.com/johndpope/MegaPortrait/blob/master/model.py#L82
UPDATE Im pretty sure we can piggy back off the VOODOO3D paper (code in june)
- even though it's for NERF. https://arxiv.org/pdf/2312.04651
I tested real3dportrait, it's seems to be inaccurate and the video take 3h for 2m talking vid completion time (too long). How about torrent you can not download. Hope to see your project run.
so a few days ago i was looking at some other code - basically claude thinks there's enough to avoid needing the megaportrait code - specifically the 4D tensor of volumetric features this supposedly handles it.
self.appearance_extractor = AppearanceFeatureExtractor()
class AppearanceFeatureExtractor(nn.Module):
# 3D appearance features extractor
# [N,3,256,256]
# [N,64,256,256]
# [N,128,128,128]
# [N,256,64,64]
# [N,512,64,64]
# [N,32,16,64,64]
def __init__(self, model_scale='standard'):
super().__init__()
use_weight_norm = False
down_seq = [64, 128, 256]
n_res = 6
C = 32
D = 16
self.in_conv = ConvBlock2D("CNA", 3, down_seq[0], 7, 1, 3, use_weight_norm)
self.down = nn.Sequential(*[DownBlock2D(down_seq[i], down_seq[i + 1], use_weight_norm) for i in range(len(down_seq) - 1)])
self.mid_conv = nn.Conv2d(down_seq[-1], C * D, 1, 1, 0)
self.res = nn.Sequential(*[ResBlock3D(C, use_weight_norm) for _ in range(n_res)])
self.C, self.D = C, D
def forward(self, x):
x = self.in_conv(x)
x = self.down(x)
x = self.mid_conv(x)
N, _, H, W = x.shape
x = x.view(N, self.C, self.D, H, W)
x = self.res(x)
return x
with all do respect , but i don't actually believe that OPUS or even current llms (gpt-4 turbo-Opus-google's latest thing whatever the name-llama 3 400B) etc can accurately implement machine learning papers , I tried it multiple times and it just misses so many points ,and makes really simple mistakes like as if it doesn't even have a clue what it is writing , good thing Mr john that you document every step . I think your best shot will be with gpt-5 , i think in order to have an advanced llm implement a machine learning paper , you gotta have some kind of agentic thing like devin but with a reasoning of gpt-5 for example , you provide the paper and a similar code to the paper you wanna implement (for example you upload vasa-1 paper and make it fully read Audio2Head code) and then it will start developing off of it , just like professional software engineers ? what do you think Mr John?
if gpt-5 can't do that , then good luck having any kind of llm implement any machine learning paper before 2026
@francqz31 i agree with most of your thoughts. The world will be different place when gpt5 drops. I’d add - don’t use chatgpt4 / use opus - and if the code it’s spitting out is / or feels off / discard the chat and start a fresh with updates. Eg base code + paper / increment logic / llm goes off on wrong tangent / discard chat / feed it updated code and even given it more context / header files or relevant code form other repos etc.
I completely rebuild megaPortrait codebase - https://github.com/johndpope/megaPortrait-hack need to wire up the dataloaders. can't decide on best approach. https://github.com/johndpope/MegaPortrait-hack/issues/2
UPDATE - I found some loss functions from SamsungLabs in Rome repo
This work at SamsungLabs - would flow on from MegaPortraits.
UPDATE @francqz31 - maybe too early to call it - but I just start training MegaPortrait https://github.com/johndpope/MegaPortrait-hack
ok - so took me a month - but i believe i got the dependent paper MegaPortraits implmented. https://github.com/johndpope/MegaPortrait-hack/tree/main there's actually going to be a new code upgrade/ with video data from FB dropping in July 24 - https://github.com/neeek2303/EMOPortraits
i am running local training on a couple of videos https://github.com/johndpope/MegaPortrait-hack/pull/21
the interesting thing with this paper is - there's no keypoints - it's all resnet feature maps with warping. UPDATE running some numbers passed chatgpt - 250 seconds / epoch - 200,000 epochs will take like 2 years on 3090. / 2 months on a h100.
UPDATE 2 - some warping code is taking long time - I chop it out for now.
do u still need talking head video dataset?we collected some
hi @fenghe12 - sorry for late reply - been distracted recreating code for this paper - https://arxiv.org/pdf/2405.07257 https://github.com/johndpope/SPEAK-hack
I would appreciate any help in cross checking code with paper. I include some test inference code
if you want to share link to videos - happy to grab them.
this paper by Microsoft - Implicit Motion Function https://openaccess.thecvf.com/content/CVPR2024/papers/Gao_Implicit_Motion_Function_CVPR_2024_paper.pdf
I recreate here https://github.com/johndpope/IMF
(assume it's all wrong -i had to switch in ResNets as feature extractor (it's not mentioned in paper) yet it seems to be converging) https://wandb.ai/snoozie/IMF/runs/f9o9vvje?nw=nwusersnoozie
UPDATE - sorry - this needs completely redoing - https://github.com/johndpope/IMF/tree/v1
Lmk if you're planning on training, i could maybe help