Roadmap - Githubissues

johndpope commented 3 months ago

Expanding the provided code to fully recreate the VASA-1 system as described in the research paper would require a significant amount of additional code and architectural changes. Here's a high-level outline of the key components and changes needed:

Expressive and Disentangled Face Latent Space: Implement the 3D-aided face representation using a canonical 3D appearance volume, identity code, 3D head pose, and facial dynamics code. Design and implement the losses for learning the disentangled latent space, including the pairwise head pose and facial dynamics transfer loss and the face identity similarity loss for cross-identity motion transfer. Train the face encoder and decoder on the VoxCeleb2 dataset to learn the disentangled latent space. Holistic Facial Dynamics Generation with Diffusion Transformer: Implement the diffusion transformer model for generating holistic facial dynamics and head motion sequences conditioned on audio and other control signals. Modify the training pipeline to use the denoising score matching objective for the diffusion model. Implement the conditioning signals, including the audio features, main eye gaze direction, head-to-camera distance, and emotion offset. Apply classifier-free guidance during inference for controllable generation. Talking Face Video Generation: Modify the generator architecture to take the generated motion latent codes and the appearance and identity features from the face encoder as input. Implement the sliding-window approach for generating long video sequences efficiently. Training and Evaluation: Prepare the training datasets, including the preprocessed VoxCeleb2 dataset and the high-resolution talk video dataset. Implement the training loop with the appropriate losses and optimization techniques. Evaluate the generated videos using metrics like audio-lip synchronization (SyncNet), audio-pose alignment (CAPP score), pose variation intensity, and Fréchet Video Distance (FVD). Real-time Optimization: Optimize the model architectures and inference pipeline to achieve real-time generation of high-resolution videos (e.g., 512x512 at 40 FPS). Implement techniques like model compression, efficient architectures, and parallel processing to reduce computational overhead. Additional Features: Incorporate support for controllable generation using the optional conditioning signals like main gaze direction, head distance, and emotion offset. Implement the capability to handle out-of-distribution images and audio inputs, such as artistic photos, singing audio clips, and non-English speech. Please note that implementing all these components and changes would require a significant amount of code, as well as access to the necessary datasets and computational resources. The provided code serves as a starting point, but substantial modifications and additions would be needed to fully recreate the VASA-1 system.

It's important to refer to the original research paper for more detailed information on the architectures, training procedures, and specific implementation details to ensure an accurate reproduction of the VASA-1 model.

samsara-ku commented 3 months ago

@johndpope

Hi! I'm looking around your nice project nowadays so that I have a question for your project, especially about the diffusion_transformer

In the paper, they apply transformer architecture for their diffusion process and cite some papers like DiT.

In particular, we apply a transformer architecture [55, 36, 50] for our sequence generation task.

So the point is like this: Is it okay to implement diffusion_transformer architecture w/o details of DiT like adaLN-zero? I think the authors (maybe) implement their own tricks kind of that for HQ qualitative results.

I just want to know your opinion about this issue.

johndpope commented 3 months ago

to be upfront - I'm not 100% confident in the transformer code - I pressed Claude to justify why there's no patches / patchify in diffusion transformer - and it was like - "oh, there's no patching mentioned in the VASA paper."

This guy does a good job explaining things - DiT: Scalable Diffusion Models with Transformers

The training code aligns to paper diffusion_transformer = DiffusionTransformer(num_layers=6, num_heads=8, hidden_size=512) https://github.com/johndpope/VASA-1-hack/blob/main/Net.py#L157 and it does say that the diffusuion transfomer was simple.

If there's specific code from 55, 36,50 - white papers or git repos - chuck the concatentated code in cat *.py > all.py - and throw into the references - I can ask Claude to align the code to explicitly reference them.

am currently looking to boot up https://github.com/johndpope/MegaPortrait/

samsara-ku commented 3 months ago

Thx for your opinion.

I just wondered why they cite DIffusion Transformer which is ViT-like transformer, even it seems like they just apply just simple architecture.

If I get a new insight from the paper, I'll let you share about it.

johndpope commented 3 months ago

doing some more digging if I search google for diffusion transformer - i get this result https://github.com/real-stanford/diffusion_policy/blob/main/diffusion_policy/model/diffusion/transformer_for_diffusion.py

while this is more complicated - it does have TransformerEncoderLayer

That is implemented here - and it contains the QKV out of the box https://github.com/johndpope/VASA-1-hack/blob/main/Net.py#L176C16-L176C39

johndpope commented 2 months ago

I just realized the reference [55,36,50] is the DiffPoseTalk https://github.com/DiffPoseTalk/DiffPoseTalk/issues/3 hopefully they release code soon so can cherry pick.

samsara-ku commented 2 months ago

Few days after I asked you that question, I got some information about this problem. Here is my opinion:

The authors just referred the DiT paper because they had similar architecture with respect to "diffusion network using transformer". I just scanned DiT and DiffPoseTalk both papers and concluded that VASA-1 is more similar with the DiffPoseTalk.
Probably, DiffPoseTalk would be related with the paper FaceFormer by citing this clause:

A notable design is an alignment mask between the encoder and the decoder, similar to that in Fan et al. (2022), which ensures proper alignment of the speech and motion modalities.

So I think it is better to start with the FaceFormer codes, cuz there is some delay for releasing the code (even someone do not release codes despite of coming soon README :( )

johndpope commented 2 months ago

FYI - I create a new repo for megaportrait - https://github.com/johndpope/MegaPortrait-hack

The code was originally forked from Kevin Fringe - but as it’s progressed - it’s been a complete rebuild

johndpope / VASA-1-hack

Roadmap #3