hugofloresgarcia / vampnet

music generation with masked transformers!
https://hugo-does-things.notion.site/VampNet-Music-Generation-via-Masked-Acoustic-Token-Modeling-e37aabd0d5f1493aa42c5711d0764b33?pvs=4
MIT License
289 stars 35 forks source link

codec training required ? #39

Open tig3rmast3r opened 3 weeks ago

tig3rmast3r commented 3 weeks ago

Hallo, i'm having hard time making a good (coarse) pth from scratch, despite the huge number of chunks (i'm now at around 100k) what i get after many tries with many combinations is a model that is unable to go under 5.75 loss, while the default fine-tuned one with a very small subset (2k chunks) easily reach 5.4 with the same validation set. result is that the models is able to make good drums but nearly unable to produce decent sounds most of the times, even with a lot of "hints" is basically unable to "follow" the sounds that i'm giving, if i compare it with the default fine-tuned model is like day and night. While one of the reason could be my dataset that is very genre specific and may be a bit redundant and not enough generalized, i'm wondering if i need to train the codec.pth with my specific dataset to get better results.

As you never mentioned this on your project here i'm not sure if it will have any relevance for the end result. In case i have to train the codec do i have to follow the settings from the codec.pth you have provided in order to avoid issues/extra coding in vampnet? below metadata from the latest codec from descriptinc:

encoder_dim: 64
encoder_rates: [2, 4, 8, 8]
latent_dim: 128
decoder_dim: 1536
decoder_rates: [8, 8, 4, 2]
n_codebooks: 18
codebook_size: 1024
codebook_dim: 8
quantizer_dropout: False
sample_rate: 44100

while here's yours from ismir2023

encoder_dim: 64
encoder_rates: [2, 6, 8, 8]
decoder_dim: 1536
decoder_rates: [6, 4, 4, 2, 2, 2]
n_codebooks: 14
codebook_size: 1024
codebook_dim: 8
quantizer_dropout: False
sample_rate: 44100

thank you for your time

hugofloresgarcia commented 2 weeks ago

Hi!

How many hours of audio are in your dataset? Training from scratch will only work with lots of data (~800k tracks for me, between 20-30k hours of audio IIRC)

tig3rmast3r commented 2 weeks ago

i'm providing around 100k 10sec chunks, so that means around 280 hours. Sources: some pro samples sounds/loop sets (around 10%) ~20k tracks where i've extracted max 3 chunks for each one. a part of the above chunks (around 30%) are made using demucs on already existing chunks where i've removed drums as i had problems generating sounds i thought to give sounds more relevance doubling them without drums.

So i'm basically at 10% of required data if not less... :(