tsurumeso / vocal-remover

Vocal Remover using Deep Neural Networks
MIT License
1.55k stars 222 forks source link

Training my own model - Questions about understanding the process better #118

Open KatieBelli opened 1 year ago

KatieBelli commented 1 year ago

My goal is to train my own models with this code. For testing, I started by creating 5 files, divided into the mixed file and instrumental file (=10 files, each one lasts about 30 seconds). Both are perfectly synced.

Here are my questions to learn more and understand the process of training a model better:

  1. In my test with 5 files, it starts with the second file, not with the first one and also doesn't mention any other of the 4 other files while training. Why doesn't it start with the first file? All files are named correctly. 01 question

  2. The current training started around 12am, it's now 5pm and it's at epoch 117 now. How many epochs does it need until it's finished? Or can I stop at a certain epoch? If yes, which number of epoch is recommended and how do I correctly stop it to avoid issues when using my trained model?

  3. Can I continue training a model? If yes, which command do I need to type into cmd?

  4. Are 30 seconds per file enough or too long/short? What is the ideal length of the audio files for training?

I appreciate any support because I want to learn more about this. I researched but couldn't find matching answers and hope to find them here.

Edit:

  1. If I want to use my own trained model, do I need to open the inference.py file and change the default in line number 111? If yes, what file name do I need to replace "'models/baseline.pth'" with?

02 question

This is how it looks now in my folders:

02_02 question

02_03 question

tsurumeso commented 1 year ago

In my test with 5 files, it starts with the second file, not with the first one and also doesn't mention any other of the 4 other files while training. Why doesn't it start with the first file? All files are named correctly.

80% of the dataset is used for training and 20% of them for validation. The displayed one is the dataset for validation and the rest is used for training.

The current training started around 12am, it's now 5pm and it's at epoch 117 now. How many epochs does it need until it's finished? Or can I stop at a certain epoch? If yes, which number of epoch is recommended and how do I correctly stop it to avoid issues when using my trained model?

200 epochs (the default setting) are recommended.

Can I continue training a model? If yes, which command do I need to type into cmd?

No.

Are 30 seconds per file enough or too long/short? What is the ideal length of the audio files for training?

You should use a typical length (about 3 - 6 minutes) of audio files.

If I want to use my own trained model, do I need to open the inference.py file and change the default in line number 111? If yes, what file name do I need to replace "'models/baseline.pth'" with?

The following command is what you need.

python inference.py --input path/to/an/audio/file --pretrained_model path/to/a/latest/model/file --gpu 0
KatieBelli commented 1 year ago

Thank you, @tsurumeso!

200 epochs are alot. My PC can handle 1 epoch in about 20-35 minutes. I wish there was an option to continue training a model because I can't let my PC run several days without a break. But maybe that's possible someday.

I'm happy if you could help me with those 2 situations as well:

  1. How many files are recommended? I trained my model with 28 files and got decent results (about 13 epochs) but it still needed more training. But I think the more files I add, the worse the quality of the vocals or seperating quality in general. (I always use lossless files, put together from synced multitracks)

  2. I noticed that sometimes the "best validation loss" is displayed less and less the further the epoch number goes. In the beginning there are more "best validation loss" but it doesn't appear again after like 10-15 epochs. I tried it until almost 40 epochs and I wondered if it's normal that the amount of "best validation loss" descreases?

KatieBelli commented 1 year ago

I'm sorry, I accidentally clicked that this thread should be closed with my last comment but I didn't want that, see my last comment.

KatieBelli commented 1 year ago

I did several trainings and I'd love to dive deeper to train a good model but didn't want to open a new issue:

  1. Is splitting the audio files into chunks (like 30 seconds or 60 seconds) better for training? I noticed that if I split them into chunks that I get a more constant loss of validation in my trainings.

(I know you said they should be "normal" length songs but when I train with "normal" length songs then my training doesn't go as well.)

  1. Should I leave the default parameters as they are or is it helpful to change them?

I wonder if increasing the patches/accumulation steps/mixup rate parameter can help or if it's best if they should be the default value.

  1. Is it normal that the first epochs are going well (always "best validation loss") but after that it only happens after like 2-3 epochs and then it rarely happens?

  2. Is it possible that it doesn't go well for like 6-9 epochs and then suddenly it decreases like crazy for the next epochs? (Maybe I'm not patient enough if this scenario is possible.)

  3. What's the recommended length for the training and validation dataset?

(My training dataset has a length of about 3 hours and 49 minutes.)

  1. How do I know if the model is very well trained so I can stop the training overall?

Should the validation loss come very close to 0?

Here is an example of one of my trainings with 60 seconds chunks:

IMG_20230917_171158_edit_1970372453544650.jpg

Looking forward to your answers.

mumu233desu commented 5 months ago

Sorry to disturbance, could you please share your hardware and the video memory usage when you're training? I'm considering slicing audios to shorter ones to prevent problems like out of memory.

tsurumeso commented 5 months ago

There is no need to split the audio in advance. This is because it is automatically split to the appropriate size during training. if you don't have enough VRAM, it is better to reduce the batch size. By the way, I use RTX3090 for training.

mumu233desu commented 5 months ago

There is no need to split the audio in advance. This is because it is automatically split to the appropriate size during training. if you don't have enough VRAM, it is better to reduce the batch size. By the way, I use RTX3090 for training.

Oh I see. By the way, I'm adding reverb and some other effects to the audio, so I can split it at the same time. Then I'll apply the effects to whole songs. Thanks for reply!

Sorry for another question, which kind of ability does training require if I need to rent a GPU card, the binary32 or the binary 16? (Sorry for my insufficient code knowledge)