mimbres / YourMT3

multi-task and multi-track music transcription for everyone
GNU General Public License v3.0
96 stars 2 forks source link

Question about transcribe only singing voice data #10

Closed Joanna1212 closed 1 month ago

Joanna1212 commented 1 month ago

Hello,

I am trying to train a model to transcribe only vocal data. I set the parameters as follows: '-tk' 'singing_v1' '-d' 'all_singing_v1', which are the task and training data. However, I encountered an error in the model part: './amt/src/model/t5mod.py', line 633 'b, k, t, d = inputs_embeds.size()', where there are only three dimensions torch.Size([6, 1024, 512]).

How should I modify this to train successfully? Should I set any other parameters? Thanks!

mimbres commented 1 month ago

Hi @Joanna1212

Joanna1212 commented 1 month ago

args=('yourmt3_only_sing_voice_3' '-tk' 'singing_v1' '-d' 'all_singing_v1' '-dec' 'multi-t5' '-nl' '26' '-enc' 'perceiver-tf' '-sqr' '1' '-ff' 'moe' '-wf' '4' '-nmoe' '8' '-kmoe' '2' '-act' 'silu' '-epe' 'rope' '-rp' '1' '-ac' 'spec' '-hop' '300' '-atc' '1' '-pr' '16-mixed' '-bsz' '12' '12' '-st' 'ddp' '-se' '1000000' CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python train.py "${args[@]}" This way 👆!

I only want to transcribe the singing voice track(single-track prediction).

thanks!

Joanna1212 commented 1 month ago

I set confit.py 's "num_channels": from 13 to 1 , it seems work, Let's try the training

mimbres commented 1 month ago

@Joanna1212 Sorry for the confusion about the task prefix. I looked further into the code, and in the current version, the 'singing_v1' task is no longer supported. We deprecated using prefix tokens for exclusive transcription of specific instruments due to no performance benefits.

Joanna1212 commented 1 month ago

Thank you for your detailed response. I'll try training your final model. Extracting the singing track (100) through post-processing is very easy. I have already completed it.

However, I noticed some minor errors of singing vocie on some pop music (as you mentioned in your paper). Therefore, I hope to supplement some vocal transcription data to improve the accuracy of vocal transcription.

The dataset I want to add consists of complete songs (vocals mixed with accompaniment and splited vocal track) and the corresponding vocal MIDI, just this one track. I notice you only use vocal track of MIR-ST500, CMedia. Do you think use plenty of converted_Mixture.wav can be better than just online augmentation ? 🫡

Joanna1212 commented 1 month ago

Perhaps I should add vocal datasets to the current dataset in "all_cross_final," continuously adding with splited vocal datasets like mir_st500_voc. and keep task of "mc13_full_plus_256" with multi-channel decoder,

or to complete "p_include_singing" part (probability of including singing for cross augmented examples)

Maybe his would enhance vocal performance based on multi-track transcription?

Joanna1212 commented 1 month ago

I noticed that you used temperature-based sampling in your paper to determine the proportions of each dataset.

For my scenario, where I am only interested in vocals, Do you think I should adjust the proportion of the singing voice datasets (MIR-ST500, CMedia) to be higher?

Additionally, you mentioned, 'We identified the dataset most prone to over-fitting, as shown by its validation loss curve.' Did you train each dataset separately to observe this, or did you observe the validation results of individual datasets during the overall training? Thanks!

mimbres commented 1 month ago

@Joanna1212

Do you think use plenty of converted_Mixture.wav can be better than just online augmentation ?

This pre-release code lacks the unannotated instrument masking (for training) feature, which will be added in an update later this month. I've seen a 1-2% performance improvement, which could be higher with more data.

"all_cross_final"

Yes, I recommend to modify all_cross_final in data_preset.py. For example:

 "all_cross_final": {
        "presets": [
            ...
           `YOUR_DATASET_NAME`
        ],
       "weights": [..., `YOUR_SAMPLING_WEIGHT`],
       "eval_vocab": [..., SINGING_SOLO_CLASS],
       ...

I noticed that you used temperature-based sampling...

The main point of our paper is that exact temperature-based sampling (of the original MT3) significantly degrades performance. See more details in Appendix G (not F; 😬 found a typo). However, if the datasets are of similar quality, you can weight them proportionally. For example, if your custom singing data is similar in size to MIRST-500, assign them similar weights. It’s okay if the total sum of the added weights exceeds 1.

did you observe the validation results of individual dataset...

Yes. In wandb logger, dataloader_idx is in the same order as the datasets defined in the data_preset.

Screenshot 2024-09-11 at 13 00 47
Joanna1212 commented 1 month ago

thanks, I'll try this with more vocal data. I understand your explanation about the wandb logger. Thank you for your response and advice.

Joanna1212 commented 1 month ago

This pre-release code lacks the unannotated instrument masking (for training) feature, which will be added in an update later this month. I've seen a 1-2% performance improvement, which could be higher with more data.

I tried adding some vocal data. Initially, the metrics showed a slight improvement, but soon there was a gradient explosion. The metrics were slightly better on cmedia and mir_st500. 👍

BTW, Please notify me if there is an update😄. Thanks