this project needs a complete rewrite - in progress....

johndpope commented 5 days ago

https://github.com/neeek2303/EMOPortraits/issues/28

Screenshot from 2024-11-20 10-57-03

This is 1x frame of 50 in the frame window. 1.2gb per window ( this is the volumetric disentanglement - then with the identity embedding it will switch) Screenshot from 2024-11-20 13-50-53

Screenshot from 2024-11-20 14-08-49

I rebase the work around emoportraits above is the 96x16x64x64 - the data is huge.... i was attempting to do the volumetric feature extraction from emo on the fly - but running into OOM problems on 3090

i since then separate into a complete saved out window frame - to bypass loading gbase / volumetric avatar.

this might make it trainable on 24gb gpu -

if anyone has ideas on tensor compression (the images above) https://gist.github.com/johndpope/7b3b437c90352b16548f57c679538d94

i have 4 x 1-2mb videos - they're taking up 151gb.... this is untenable.

then i gotta jam it through the diffusion transformer ENCODER (with conditions) .....

Screenshot from 2024-11-20 14-21-31

sultanchamberlain commented 5 days ago

if the output of your EMO is 16GB, something is wrong because

Screenshot 2024-11-19 195808

it's supposed to only be 255MB

johndpope commented 5 days ago

(preface I'm using claude all the time and i get sensible answers) but yeah - bizarely - I quizzed claude why is it 1.2gb (after thinking it was similar to your answer + smaller) and it's like oh yeah sorry - I was wrong...

running python - the calculation is 10x on that..... (again we're talking 50 frames / 1.6 seconds)


size = 1 * 50 * 96 * 16 * 64 * 64  # number of elements
bytes_float32 = size * 4  # each float32 is 4 bytes
bytes_float16 = size * 2  # each float16 is 2 bytes

print(f"Number of elements: {size:,}")
print(f"Size in float32: {bytes_float32/1e9:.2f} GB")
print(f"Size in float16: {bytes_float16/1e9:.2f} GB")

Number of elements: **314,572,800**
Size in float32: 1.26 GB
Size in float16: 0.63 GB

I'm uploading the video 3 (saved from vasa website) 1mb - which inflates to 18.9gb - what a joke.... 4 hrs left....

chatgpt - preview suggested some compression on h5 file (down to 250mb) - https://gist.github.com/johndpope/ecf1ec0722c5b7a8c584b5cfa8f07658

my new code based off emo - (unreleased)


INFO     Initialized TensorMemoryManager singleton                         mem.py:416
           INFO     Initialized VASA with context_size=10                      vasa_model.py:821
           INFO     Initializing HolisticMotionTransformer...                  vasa_model.py:390
           INFO                                                                vasa_model.py:400
                    Temporal Dimensions:                                                        
           INFO       Window size (T): 50                                      vasa_model.py:401
           INFO       Context size (K): 10                                     vasa_model.py:402
           INFO       Total temporal dim: 60                                   vasa_model.py:403
           INFO                                                                vasa_model.py:418
                    Model Dimensions:                                                           
           INFO       Projection dim: 512                                      vasa_model.py:419
           INFO       Transformer dim: 512                                     vasa_model.py:420
           INFO                                                                vasa_model.py:110
                    Initialized EfficientConditionEmbedding:                                    
           INFO       Model dimension: 512                                     vasa_model.py:111
           INFO       Max sequence length: 60                                  vasa_model.py:112
           INFO       Channel layout: {'audio': (0, 256), 'motion': (256,      vasa_model.py:113
                    384), 'gaze': (384, 386), 'distance': (386, 387),                           
                    'emotion': (387, 389), 'speed': (389, 390), 'learned':                      
                    (390, 512)}                                                                 
           INFO       Learned features size: 122                               vasa_model.py:114
           INFO       Total channels: 512                                      vasa_model.py:115
           INFO                                                                vasa_model.py:702
                    === Parameter Count Analysis ===                                            

╭────────────────────────────────── Parameter Count Analysis ──────────────────────────────────╮
│               Model Parameter Summary                                                        │
│ ┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┓                                          │
│ ┃ Component             ┃ Parameters ┃ % of Total ┃                                          │
│ ┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━┩                                          │
│ │ encoder               │     27.33M │      67.2% │                                          │
│ │ source_proj           │      4.20M │      10.3% │                                          │
│ │ volume_proj           │      3.85M │       9.5% │                                          │
│ │ time_embed            │      2.10M │       5.2% │                                          │
│ │ output_proj           │      2.10M │       5.2% │                                          │
│ │ cond_embed            │    574.36K │       1.4% │                                          │
│ │ pre_transformer_proj  │    262.66K │       0.6% │                                          │
│ │ post_transformer_proj │    262.66K │       0.6% │                                          │
│ │ pos_embed             │      1.02K │       0.0% │                                          │
│ │ Total                 │     40.67M │     100.0% │                                          │
│ └───────────────────────┴────────────┴────────────┘                                          │
╰──────────────────────────────────────────────────────────────────────────────────────────────╯

Trainable Parameters: 40.67M (100.0%)

                      cond_embed Detailed Breakdown                      
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Layer                           ┃ Shape      ┃ Parameters ┃ Trainable ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ cond_embed.audio_proj.0.weight  │ (512, 768) │    393.22K │ ✓         │
│ cond_embed.audio_proj.0.bias    │ (512,)     │        512 │ ✓         │
│ cond_embed.audio_proj.2.weight  │ (256, 512) │    131.07K │ ✓         │
│ cond_embed.audio_proj.2.bias    │ (256,)     │        256 │ ✓         │
│ cond_embed.motion_proj.0.weight │ (128, 384) │     49.15K │ ✓         │
│ cond_embed.motion_proj.0.bias   │ (128,)     │        128 │ ✓         │
│ cond_embed.gaze_norm.weight     │ (2,)       │          2 │ ✓         │
│ cond_embed.gaze_norm.bias       │ (2,)       │          2 │ ✓         │
│ cond_embed.distance_norm.weight │ (1,)       │          1 │ ✓         │
│ cond_embed.distance_norm.bias   │ (1,)       │          1 │ ✓         │
│ cond_embed.emotion_norm.weight  │ (2,)       │          2 │ ✓         │
│ cond_embed.emotion_norm.bias    │ (2,)       │          2 │ ✓         │
│ cond_embed.speed_proj.weight    │ (9, 1)     │          9 │ ✓         │
└─────────────────────────────────┴────────────┴────────────┴───────────┘

                     pos_embed Detailed Breakdown                      
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Layer                        ┃ Shape       ┃ Parameters ┃ Trainable ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ pos_embed.context_embedding  │ (1, 1, 512) │        512 │ ✓         │
│ pos_embed.sequence_embedding │ (1, 1, 512) │        512 │ ✓         │
└──────────────────────────────┴─────────────┴────────────┴───────────┘

                    volume_proj Detailed Breakdown                     
┏━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Layer                ┃ Shape               ┃ Parameters ┃ Trainable ┃
┡━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ volume_proj.0.weight │ (128, 96, 1, 1, 1)  │     12.29K │ ✓         │
│ volume_proj.0.bias   │ (128,)              │        128 │ ✓         │
│ volume_proj.1.weight │ (128,)              │        128 │ ✓         │
│ volume_proj.1.bias   │ (128,)              │        128 │ ✓         │
│ volume_proj.3.weight │ (256, 128, 1, 3, 3) │    294.91K │ ✓         │
│ volume_proj.3.bias   │ (256,)              │        256 │ ✓         │
│ volume_proj.4.weight │ (256,)              │        256 │ ✓         │
│ volume_proj.4.bias   │ (256,)              │        256 │ ✓         │
│ volume_proj.6.weight │ (512, 256, 3, 3, 3) │      3.54M │ ✓         │
│ volume_proj.6.bias   │ (512,)              │        512 │ ✓         │
│ volume_proj.7.weight │ (512,)              │        512 │ ✓         │
│ volume_proj.7.bias   │ (512,)              │        512 │ ✓         │
└──────────────────────┴─────────────────────┴────────────┴───────────┘

                source_proj Detailed Breakdown                 
┏━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Layer                ┃ Shape       ┃ Parameters ┃ Trainable ┃
┡━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ source_proj.1.weight │ (512, 8192) │      4.19M │ ✓         │
│ source_proj.1.bias   │ (512,)      │        512 │ ✓         │
│ source_proj.2.weight │ (512,)      │        512 │ ✓         │
│ source_proj.2.bias   │ (512,)      │        512 │ ✓         │
└──────────────────────┴─────────────┴────────────┴───────────┘

                time_embed Detailed Breakdown                 
┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Layer               ┃ Shape       ┃ Parameters ┃ Trainable ┃
┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ time_embed.0.weight │ (2048, 512) │      1.05M │ ✓         │
│ time_embed.0.bias   │ (2048,)     │      2.05K │ ✓         │
│ time_embed.2.weight │ (512, 2048) │      1.05M │ ✓         │
│ time_embed.2.bias   │ (512,)      │        512 │ ✓         │
└─────────────────────┴─────────────┴────────────┴───────────┘

               pre_transformer_proj Detailed Breakdown               
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Layer                       ┃ Shape      ┃ Parameters ┃ Trainable ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ pre_transformer_proj.weight │ (512, 512) │    262.14K │ ✓         │
│ pre_transformer_proj.bias   │ (512,)     │        512 │ ✓         │
└─────────────────────────────┴────────────┴────────────┴───────────┘

                       encoder Detailed Breakdown                        
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Layer                          ┃ Shape       ┃ Parameters ┃ Trainable ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ encoder.0.norm1.weight         │ (512,)      │        512 │ ✓         │
│ encoder.0.norm1.bias           │ (512,)      │        512 │ ✓         │
│ encoder.0.attn.in_proj_weight  │ (1536, 512) │    786.43K │ ✓         │
│ encoder.0.attn.in_proj_bias    │ (1536,)     │      1.54K │ ✓         │
│ encoder.0.attn.out_proj.weight │ (512, 512)  │    262.14K │ ✓         │
│ encoder.0.attn.out_proj.bias   │ (512,)      │        512 │ ✓         │
│ encoder.0.norm2.weight         │ (512,)      │        512 │ ✓         │
│ encoder.0.norm2.bias           │ (512,)      │        512 │ ✓         │
│ encoder.0.mlp.0.weight         │ (2048, 512) │      1.05M │ ✓         │
│ encoder.0.mlp.0.bias           │ (2048,)     │      2.05K │ ✓         │
│ encoder.0.mlp.3.weight         │ (512, 2048) │      1.05M │ ✓         │
│ encoder.0.mlp.3.bias           │ (512,)      │        512 │ ✓         │
│ encoder.0.cond_norm.weight     │ (512,)      │        512 │ ✓         │
│ encoder.0.cond_norm.bias       │ (512,)      │        512 │ ✓         │
│ encoder.0.cond_proj.weight     │ (512, 512)  │    262.14K │ ✓         │
│ encoder.0.cond_proj.bias       │ (512,)      │        512 │ ✓         │
│ encoder.1.norm1.weight         │ (512,)      │        512 │ ✓         │
│ encoder.1.norm1.bias           │ (512,)      │        512 │ ✓         │
│ encoder.1.attn.in_proj_weight  │ (1536, 512) │    786.43K │ ✓         │
│ encoder.1.attn.in_proj_bias    │ (1536,)     │      1.54K │ ✓         │
│ encoder.1.attn.out_proj.weight │ (512, 512)  │    262.14K │ ✓         │
│ encoder.1.attn.out_proj.bias   │ (512,)      │        512 │ ✓         │
│ encoder.1.norm2.weight         │ (512,)      │        512 │ ✓         │
│ encoder.1.norm2.bias           │ (512,)      │        512 │ ✓         │
│ encoder.1.mlp.0.weight         │ (2048, 512) │      1.05M │ ✓         │
│ encoder.1.mlp.0.bias           │ (2048,)     │      2.05K │ ✓         │
│ encoder.1.mlp.3.weight         │ (512, 2048) │      1.05M │ ✓         │
│ encoder.1.mlp.3.bias           │ (512,)      │        512 │ ✓         │
│ encoder.1.cond_norm.weight     │ (512,)      │        512 │ ✓         │
│ encoder.1.cond_norm.bias       │ (512,)      │        512 │ ✓         │
│ encoder.1.cond_proj.weight     │ (512, 512)  │    262.14K │ ✓         │
│ encoder.1.cond_proj.bias       │ (512,)      │        512 │ ✓         │
│ encoder.2.norm1.weight         │ (512,)      │        512 │ ✓         │
│ encoder.2.norm1.bias           │ (512,)      │        512 │ ✓         │
│ encoder.2.attn.in_proj_weight  │ (1536, 512) │    786.43K │ ✓         │
│ encoder.2.attn.in_proj_bias    │ (1536,)     │      1.54K │ ✓         │
│ encoder.2.attn.out_proj.weight │ (512, 512)  │    262.14K │ ✓         │
│ encoder.2.attn.out_proj.bias   │ (512,)      │        512 │ ✓         │
│ encoder.2.norm2.weight         │ (512,)      │        512 │ ✓         │
│ encoder.2.norm2.bias           │ (512,)      │        512 │ ✓         │
│ encoder.2.mlp.0.weight         │ (2048, 512) │      1.05M │ ✓         │
│ encoder.2.mlp.0.bias           │ (2048,)     │      2.05K │ ✓         │
│ encoder.2.mlp.3.weight         │ (512, 2048) │      1.05M │ ✓         │
│ encoder.2.mlp.3.bias           │ (512,)      │        512 │ ✓         │
│ encoder.2.cond_norm.weight     │ (512,)      │        512 │ ✓         │
│ encoder.2.cond_norm.bias       │ (512,)      │        512 │ ✓         │
│ encoder.2.cond_proj.weight     │ (512, 512)  │    262.14K │ ✓         │
│ encoder.2.cond_proj.bias       │ (512,)      │        512 │ ✓         │
│ encoder.3.norm1.weight         │ (512,)      │        512 │ ✓         │
│ encoder.3.norm1.bias           │ (512,)      │        512 │ ✓         │
│ encoder.3.attn.in_proj_weight  │ (1536, 512) │    786.43K │ ✓         │
│ encoder.3.attn.in_proj_bias    │ (1536,)     │      1.54K │ ✓         │
│ encoder.3.attn.out_proj.weight │ (512, 512)  │    262.14K │ ✓         │
│ encoder.3.attn.out_proj.bias   │ (512,)      │        512 │ ✓         │
│ encoder.3.norm2.weight         │ (512,)      │        512 │ ✓         │
│ encoder.3.norm2.bias           │ (512,)      │        512 │ ✓         │
│ encoder.3.mlp.0.weight         │ (2048, 512) │      1.05M │ ✓         │
│ encoder.3.mlp.0.bias           │ (2048,)     │      2.05K │ ✓         │
│ encoder.3.mlp.3.weight         │ (512, 2048) │      1.05M │ ✓         │
│ encoder.3.mlp.3.bias           │ (512,)      │        512 │ ✓         │
│ encoder.3.cond_norm.weight     │ (512,)      │        512 │ ✓         │
│ encoder.3.cond_norm.bias       │ (512,)      │        512 │ ✓         │
│ encoder.3.cond_proj.weight     │ (512, 512)  │    262.14K │ ✓         │
│ encoder.3.cond_proj.bias       │ (512,)      │        512 │ ✓         │
│ encoder.4.norm1.weight         │ (512,)      │        512 │ ✓         │
│ encoder.4.norm1.bias           │ (512,)      │        512 │ ✓         │
│ encoder.4.attn.in_proj_weight  │ (1536, 512) │    786.43K │ ✓         │
│ encoder.4.attn.in_proj_bias    │ (1536,)     │      1.54K │ ✓         │
│ encoder.4.attn.out_proj.weight │ (512, 512)  │    262.14K │ ✓         │
│ encoder.4.attn.out_proj.bias   │ (512,)      │        512 │ ✓         │
│ encoder.4.norm2.weight         │ (512,)      │        512 │ ✓         │
│ encoder.4.norm2.bias           │ (512,)      │        512 │ ✓         │
│ encoder.4.mlp.0.weight         │ (2048, 512) │      1.05M │ ✓         │
│ encoder.4.mlp.0.bias           │ (2048,)     │      2.05K │ ✓         │
│ encoder.4.mlp.3.weight         │ (512, 2048) │      1.05M │ ✓         │
│ encoder.4.mlp.3.bias           │ (512,)      │        512 │ ✓         │
│ encoder.4.cond_norm.weight     │ (512,)      │        512 │ ✓         │
│ encoder.4.cond_norm.bias       │ (512,)      │        512 │ ✓         │
│ encoder.4.cond_proj.weight     │ (512, 512)  │    262.14K │ ✓         │
│ encoder.4.cond_proj.bias       │ (512,)      │        512 │ ✓         │
│ encoder.5.norm1.weight         │ (512,)      │        512 │ ✓         │
│ encoder.5.norm1.bias           │ (512,)      │        512 │ ✓         │
│ encoder.5.attn.in_proj_weight  │ (1536, 512) │    786.43K │ ✓         │
│ encoder.5.attn.in_proj_bias    │ (1536,)     │      1.54K │ ✓         │
│ encoder.5.attn.out_proj.weight │ (512, 512)  │    262.14K │ ✓         │
│ encoder.5.attn.out_proj.bias   │ (512,)      │        512 │ ✓         │
│ encoder.5.norm2.weight         │ (512,)      │        512 │ ✓         │
│ encoder.5.norm2.bias           │ (512,)      │        512 │ ✓         │
│ encoder.5.mlp.0.weight         │ (2048, 512) │      1.05M │ ✓         │
│ encoder.5.mlp.0.bias           │ (2048,)     │      2.05K │ ✓         │
│ encoder.5.mlp.3.weight         │ (512, 2048) │      1.05M │ ✓         │
│ encoder.5.mlp.3.bias           │ (512,)      │        512 │ ✓         │
│ encoder.5.cond_norm.weight     │ (512,)      │        512 │ ✓         │
│ encoder.5.cond_norm.bias       │ (512,)      │        512 │ ✓         │
│ encoder.5.cond_proj.weight     │ (512, 512)  │    262.14K │ ✓         │
│ encoder.5.cond_proj.bias       │ (512,)      │        512 │ ✓         │
│ encoder.6.norm1.weight         │ (512,)      │        512 │ ✓         │
│ encoder.6.norm1.bias           │ (512,)      │        512 │ ✓         │
│ encoder.6.attn.in_proj_weight  │ (1536, 512) │    786.43K │ ✓         │
│ encoder.6.attn.in_proj_bias    │ (1536,)     │      1.54K │ ✓         │
│ encoder.6.attn.out_proj.weight │ (512, 512)  │    262.14K │ ✓         │
│ encoder.6.attn.out_proj.bias   │ (512,)      │        512 │ ✓         │
│ encoder.6.norm2.weight         │ (512,)      │        512 │ ✓         │
│ encoder.6.norm2.bias           │ (512,)      │        512 │ ✓         │
│ encoder.6.mlp.0.weight         │ (2048, 512) │      1.05M │ ✓         │
│ encoder.6.mlp.0.bias           │ (2048,)     │      2.05K │ ✓         │
│ encoder.6.mlp.3.weight         │ (512, 2048) │      1.05M │ ✓         │
│ encoder.6.mlp.3.bias           │ (512,)      │        512 │ ✓         │
│ encoder.6.cond_norm.weight     │ (512,)      │        512 │ ✓         │
│ encoder.6.cond_norm.bias       │ (512,)      │        512 │ ✓         │
│ encoder.6.cond_proj.weight     │ (512, 512)  │    262.14K │ ✓         │
│ encoder.6.cond_proj.bias       │ (512,)      │        512 │ ✓         │
│ encoder.7.norm1.weight         │ (512,)      │        512 │ ✓         │
│ encoder.7.norm1.bias           │ (512,)      │        512 │ ✓         │
│ encoder.7.attn.in_proj_weight  │ (1536, 512) │    786.43K │ ✓         │
│ encoder.7.attn.in_proj_bias    │ (1536,)     │      1.54K │ ✓         │
│ encoder.7.attn.out_proj.weight │ (512, 512)  │    262.14K │ ✓         │
│ encoder.7.attn.out_proj.bias   │ (512,)      │        512 │ ✓         │
│ encoder.7.norm2.weight         │ (512,)      │        512 │ ✓         │
│ encoder.7.norm2.bias           │ (512,)      │        512 │ ✓         │
│ encoder.7.mlp.0.weight         │ (2048, 512) │      1.05M │ ✓         │
│ encoder.7.mlp.0.bias           │ (2048,)     │      2.05K │ ✓         │
│ encoder.7.mlp.3.weight         │ (512, 2048) │      1.05M │ ✓         │
│ encoder.7.mlp.3.bias           │ (512,)      │        512 │ ✓         │
│ encoder.7.cond_norm.weight     │ (512,)      │        512 │ ✓         │
│ encoder.7.cond_norm.bias       │ (512,)      │        512 │ ✓         │
│ encoder.7.cond_proj.weight     │ (512, 512)  │    262.14K │ ✓         │
│ encoder.7.cond_proj.bias       │ (512,)      │        512 │ ✓         │
└────────────────────────────────┴─────────────┴────────────┴───────────┘

               post_transformer_proj Detailed Breakdown               
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Layer                        ┃ Shape      ┃ Parameters ┃ Trainable ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ post_transformer_proj.weight │ (512, 512) │    262.14K │ ✓         │
│ post_transformer_proj.bias   │ (512,)     │        512 │ ✓         │
└──────────────────────────────┴────────────┴────────────┴───────────┘

                 output_proj Detailed Breakdown                 
┏━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Layer                ┃ Shape        ┃ Parameters ┃ Trainable ┃
┡━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ output_proj.0.weight │ (1024, 512)  │    524.29K │ ✓         │
│ output_proj.0.bias   │ (1024,)      │      1.02K │ ✓         │
│ output_proj.2.weight │ (1536, 1024) │      1.57M │ ✓         │
│ output_proj.2.bias   │ (1536,)      │      1.54K │ ✓         │
└──────────────────────┴──────────────┴────────────┴───────────┘

Parameter Count Validation:
Target: 29.00M
Actual: 40.67M

sultanchamberlain commented 5 days ago

Yes, you’re correct Claude incorrectly calculated the the bytes and it should be 1.2gb @ float32

“ running python - the calculation is 10x on that..... (again we're talking 50 frames / 1.6 seconds)” What does this mean? I thought 50 frames in (50 x 96 x 16 x 64 x 64) implied 1 minute video @ 25 fps

update I meant I thought 50 frames in (50 x 96 x 16 x 64 x 64) implied 2 second video @ 25 fps

the bigger question is are you passing the whole 1.6 second video into the encoder?

johndpope commented 5 days ago

50 frames + 10 previous frames = window Then overlapping by 25 for next window

The above first image x50 is ingested into transformer

sultanchamberlain commented 5 days ago

Hmm I could be wrong but I thought the window is only 60 consecutive frames @30 fps. I don’t think they pass in more than that. This is what I understand from window in Wav2lip and other papers where window was usually 5 frame @25fps

I will reread VASA with this in mind

johndpope commented 9 hours ago

UPDATE - im a tiny bit blocked with stage 2 training with 24gb ram - I can spin up a 48 gb video card on vertex - but it's a bit of hassle turning on / off moving data - https://github.com/johndpope/vertex-jumpstart

this is the 1mb video -> inflated to 18gb saved out (not helpful) but this is stage2 dataset windows ready to be ingested... https://drive.google.com/drive/u/0/folders/1pw4mFmrIhjpaySNEkGkZoOvdpiUK1TjZ

I took 10 videos ~ 10mb from vasa https://www.microsoft.com/en-us/research/project/vasa-1/ - and ran them through feature extraction - getting all the volumetric embeddings (huge) / head pose / eye gaze / identity embedding (tiny)- and it's 344gb. i did this to ease memory pressure on gpu as it keeps blowing up with OOM....

if i deck my workstation with another gpu - it could maybe accommodate concurently getting the volumetric data into the diffusion transformer on the fly - 1.2gb window (batch size 1)

i have a couple of ideas for compression that I'm currently exploring in code Im yet to push-

https://huggingface.co/microsoft/Reducio-VAE - has crazy compression for short time frames. wasn't built for 16dim x 64x64 dimensions - but i draft some code up - it's training - though not quickly converging.... i leave it training overnight https://wandb.ai/snoozie/temporal-reducio

it's possible i could use IMF and tweak it to instead of producing 2d images - produce a 3d tensor - its another microsoft algorithm - and offers frankly the best compression for motion in the world that I'm aware. of. https://github.com/johndpope/IMF/blob/main/vit.py

it could compress the 1.2 gb at least over 1,000 times. the decoder would then need to inflate it and pass it into tranformer... maybe just suitable for stage 2 training

i dont clearly understand whats so important in the 16 slices - and whether this could be collapsed with vq encoder.

UPDATE - somewhat working with reducio ... (this is compressing the 16x64x64 volumetric square of which there's 96 for 1 frame and 50 /60 frames)

https://wandb.ai/snoozie/temporal-reducio/runs/ei03adr8?nw=nwusersnoozie

johndpope / VASA-1-hack

this project needs a complete rewrite - in progress.... #28