hugofloresgarcia / vampnet

music generation with masked transformers!
https://hugo-does-things.notion.site/VampNet-Music-Generation-via-Masked-Acoustic-Token-Modeling-e37aabd0d5f1493aa42c5711d0764b33?pvs=4
MIT License
290 stars 35 forks source link

Questions on training convergence #19

Closed sukun1045 closed 7 months ago

sukun1045 commented 10 months ago

Hi, thanks for open-source the code. I am trying to use a similar training approach for the codes from Encodec. May I ask how fast the model converges to generate reasonable music regarding the training steps and what would be the final loss value? Because the training contains lots of masks, the loss seems much higher than the autoregressive approach during training. Besides, have you tried training models with smaller sizes?

Thank you,

Kun

hugofloresgarcia commented 10 months ago

Hi!! Apologies for the late reply, have been out the past week due to ISMIR.

May I ask how fast the model converges to generate reasonable music regarding the training steps and what would be the final loss value?

Per the paper, we trained it for 1M steps to get it to the point where the checkpoints are at, but anecdotally I'd say you'd start getting reasonable sounds after 600k steps or a bit earlier. I believe the final loss value might depend on the dataset, but in my experience it was getting close to 5.0.

Because the training contains lots of masks, the loss seems much higher than the autoregressive approach during training.

Yes, due to the random masking, you'll likely get a noisy loss unless you can afford a huge batch size!

Besides, have you tried training models with smaller sizes?

Working on this right now but I'm not quite there yet! you can track my progress here: https://github.com/hugofloresgarcia/vampnet/tree/hf/x-tfm

sukun1045 commented 10 months ago

Thanks for your reply! As you mentioned in the paper, the 1M steps are for the coarse modeling, and 500k steps are for the coarse-to-fine model. Does it mean that the coarse-to-fine model has a faster convergence? Do you also get a similar final loss for coarse-to-fine model? Have you ever tried training a single stage model for all codebooks? Sorry for so many questions, I am looking forward to your responses.

tig3rmast3r commented 9 months ago

i can submit results after 300 ephocs (i calculated epoch using n.of samples/batch size = n.of iteration for 1 epoch, dunno if is a right calculation, by the way as my fine-tune model contained 612 samples is 612/4=153 iteration for 1 epoch)

coarse

      +---------------------------------------------------------------------+ decorators.py:220
       |                           Iteration 45899                           |                  
       +---------------------------------------------------------------------+                  
                                        train                                                   
       +---------------------------------------------------------------------+                  
       | key                                   | value        | mean         |                  
       |---------------------------------------+--------------+--------------|                  
       | accuracy-0-0.5/top1/masked            |   0.086351   |   0.104087   |                  
       | accuracy-0-0.5/top1/unmasked          |   0.099315   |   0.155171   |                  
       | accuracy-0-0.5/top25/masked           |   0.376973   |   0.405329   |                  
       | accuracy-0-0.5/top25/unmasked         |   0.482877   |   0.604361   |                  
       | accuracy-0.5-1.0/top1/masked          |   0.226596   |   0.284225   |                  
       | accuracy-0.5-1.0/top1/unmasked        |   0.142279   |   0.167416   |                  
       | accuracy-0.5-1.0/top25/masked         |   0.647872   |   0.732513   |                  
       | accuracy-0.5-1.0/top25/unmasked       |   0.624632   |   0.666140   |                  
       | loss                                  |   5.299645   |   5.089274   |                  
       | other/batch_size                      |   4.000000   |   4.000000   |                  
       | other/grad_norm                       |   0.157020   |   0.119691   |                  
       | other/learning_rate                   |   0.000261   |   0.000263   |                  
       | time/train_loop                       |   0.263031   |   0.260053   |                  
       +---------------------------------------------------------------------+                  
                                         val                                                    
       +---------------------------------------------------------------------+                  
       | key                                   | value        | mean         |                  
       |---------------------------------------+--------------+--------------|                  
       | loss                                  |   4.796672   |   5.030900   |                  
       | accuracy-0-0.5/top1/unmasked          |   0.216292   |   0.169721   |                  
       | accuracy-0-0.5/top1/masked            |   0.102733   |   0.114289   |                  
       | accuracy-0-0.5/top25/unmasked         |   0.713483   |   0.619580   |                  
       | accuracy-0-0.5/top25/masked           |   0.400801   |   0.412465   |                  
       | accuracy-0.5-1.0/top1/unmasked        |   0.219848   |   0.172446   |                  
       | accuracy-0.5-1.0/top1/masked          |   0.448613   |   0.304225   |                  
       | accuracy-0.5-1.0/top25/unmasked       |   0.717494   |   0.673347   |                  
       | accuracy-0.5-1.0/top25/masked         |   0.882545   |   0.747402   |                  
       | time/val_loop                         |   0.198203   |   0.208812   |                  
       +---------------------------------------------------------------------+                  
         Iteration (train) 45900/45900 --------------------- 5:08:20 / 0:00:00                 

c2f

      +---------------------------------------------------------------------+ decorators.py:220
       |                           Iteration 45899                           |                  
       +---------------------------------------------------------------------+                  
                                        train                                                   
       +---------------------------------------------------------------------+                  
       | key                                   | value        | mean         |                  
       |---------------------------------------+--------------+--------------|                  
       | accuracy-0-0.5/top1/masked            |   0.044610   |   0.060044   |                  
       | accuracy-0-0.5/top1/unmasked          |   0.012931   |   0.030694   |                  
       | accuracy-0-0.5/top25/masked           |   0.306382   |   0.343827   |                  
       | accuracy-0-0.5/top25/unmasked         |   0.163793   |   0.197396   |                  
       | accuracy-0.5-1.0/top1/masked          |   0.056550   |   0.107436   |                  
       | accuracy-0.5-1.0/top1/unmasked        |   0.032962   |   0.037206   |                  
       | accuracy-0.5-1.0/top25/masked         |   0.400143   |   0.477961   |                  
       | accuracy-0.5-1.0/top25/unmasked       |   0.268056   |   0.236933   |                  
       | loss                                  |   5.892028   |   5.651122   |                  
       | other/batch_size                      |   4.000000   |   4.000000   |                  
       | other/grad_norm                       |   0.055419   |   0.066452   |                  
       | other/learning_rate                   |   0.000261   |   0.000263   |                  
       | time/train_loop                       |   0.119633   |   0.112622   |                  
       +---------------------------------------------------------------------+                  
                                         val                                                    
       +---------------------------------------------------------------------+                  
       | key                                   | value        | mean         |                  
       |---------------------------------------+--------------+--------------|                  
       | loss                                  |   5.388220   |   5.576190   |                  
       | accuracy-0-0.5/top1/unmasked          |   0.025926   |   0.031996   |                  
       | accuracy-0-0.5/top1/masked            |   0.063009   |   0.065796   |                  
       | accuracy-0-0.5/top25/unmasked         |   0.225926   |   0.203248   |                  
       | accuracy-0-0.5/top25/masked           |   0.380251   |   0.360492   |                  
       | accuracy-0.5-1.0/top1/unmasked        |   0.038771   |   0.037591   |                  
       | accuracy-0.5-1.0/top1/masked          |   0.184387   |   0.121025   |                  
       | accuracy-0.5-1.0/top25/unmasked       |   0.209929   |   0.239489   |                  
       | accuracy-0.5-1.0/top25/masked         |   0.603717   |   0.500541   |                  
       | time/val_loop                         |   0.064827   |   0.065907   |                  
       +---------------------------------------------------------------------+                  
         Iteration (train) 45900/45900 --------------------- 1:46:49 / 0:00:00     

as you can see the loss in c2f is much higher than coarse

sukun1045 commented 9 months ago

Hi thanks for sharing the info, I also noticed the c2f only trains for 3s but coarse model takes 10s. Will this chunk size affect the performance?

hugofloresgarcia commented 9 months ago

yeah, we train (and infer) the c2f model with 3s chunks instead of 10s chunks, like the coarse model. We train it on shorter chunks because coarse-to-fine generation is a better-conditioned generation problem, so there's no need for the model to understand long contexts like the coarse model does.