Question about transcribe only singing voice data

Joanna1212 commented 2 months ago

Hello,

I am trying to train a model to transcribe only vocal data. I set the parameters as follows: '-tk' 'singing_v1' '-d' 'all_singing_v1', which are the task and training data. However, I encountered an error in the model part: './amt/src/model/t5mod.py', line 633 'b, k, t, d = inputs_embeds.size()', where there are only three dimensions torch.Size([6, 1024, 512]).

How should I modify this to train successfully? Should I set any other parameters? Thanks！

mimbres commented 2 months ago

Hi @Joanna1212

Can you show me all of your train.py options? That error seems to be related to the encoder/decoder type?
The singing_v1 task is an experimental option. It uses a singing prefix token, which is not covered in the paper. all_singing_v1 is also just for quick experimentation, with the sampling probability of the singing dataset increased.

Joanna1212 commented 2 months ago

args=('yourmt3_only_sing_voice_3' '-tk' 'singing_v1' '-d' 'all_singing_v1' '-dec' 'multi-t5' '-nl' '26' '-enc' 'perceiver-tf' '-sqr' '1' '-ff' 'moe' '-wf' '4' '-nmoe' '8' '-kmoe' '2' '-act' 'silu' '-epe' 'rope' '-rp' '1' '-ac' 'spec' '-hop' '300' '-atc' '1' '-pr' '16-mixed' '-bsz' '12' '12' '-st' 'ddp' '-se' '1000000' CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python train.py "${args[@]}" This way 👆!

I only want to transcribe the singing voice track（single-track prediction）.

thanks!

Joanna1212 commented 2 months ago

I set confit.py 's "num_channels": from 13 to 1 , it seems work, Let's try the training

mimbres commented 2 months ago

@Joanna1212 Sorry for the confusion about the task prefix. I looked further into the code, and in the current version, the 'singing_v1' task is no longer supported. We deprecated using prefix tokens for exclusive transcription of specific instruments due to no performance benefits.

If you set num_channel=1 with a multi-channel T5 decoder, it will behave the same as a single-channel decoder. As mentioned earlier, it will not use any prefix tokens for singing-only. Currently it is recommended to choose decoder type as 't5' and task as 'mt3_full_plus' for single channel decoding.
When using a multi-channel decoder, it is recommended to use decoder type as multi-t5 and task as mc13_full_plus_256.
The recommended approach for now is to transcribe only singing by extracting the singing program (100) through post-processing, without modifying the code. I'll provide an alternative in the next update through "exclusive" task (as prototyped in exc_v1 of config/task.py).
About max iterations, I prefer adjusting -it over using se or epoch-based counting for better managing the cosine scheduler. See https://github.com/mimbres/YourMT3/issues/2#issuecomment-2342031869

Joanna1212 commented 2 months ago

Thank you for your detailed response. I'll try training your final model. Extracting the singing track (100) through post-processing is very easy. I have already completed it.

However, I noticed some minor errors of singing vocie on some pop music (as you mentioned in your paper). Therefore, I hope to supplement some vocal transcription data to improve the accuracy of vocal transcription.

The dataset I want to add consists of complete songs (vocals mixed with accompaniment and splited vocal track) and the corresponding vocal MIDI, just this one track. I notice you only use vocal track of MIR-ST500, CMedia. Do you think use plenty of converted_Mixture.wav can be better than just online augmentation ？ 🫡

Joanna1212 commented 2 months ago

Perhaps I should add vocal datasets to the current dataset in "all_cross_final," continuously adding with splited vocal datasets like mir_st500_voc. and keep task of "mc13_full_plus_256" with multi-channel decoder,

or to complete "p_include_singing" part (probability of including singing for cross augmented examples)

Maybe his would enhance vocal performance based on multi-track transcription?

Joanna1212 commented 2 months ago

I noticed that you used temperature-based sampling in your paper to determine the proportions of each dataset.

For my scenario, where I am only interested in vocals, Do you think I should adjust the proportion of the singing voice datasets (MIR-ST500, CMedia) to be higher?

Additionally, you mentioned, 'We identified the dataset most prone to over-fitting, as shown by its validation loss curve.' Did you train each dataset separately to observe this, or did you observe the validation results of individual datasets during the overall training? Thanks!

mimbres commented 2 months ago

@Joanna1212

Do you think use plenty of converted_Mixture.wav can be better than just online augmentation ？

This pre-release code lacks the unannotated instrument masking (for training) feature, which will be added in an update later this month. I've seen a 1-2% performance improvement, which could be higher with more data.

"all_cross_final"

Yes, I recommend to modify all_cross_final in data_preset.py. For example:

 "all_cross_final": {
        "presets": [
            ...
           `YOUR_DATASET_NAME`
        ],
       "weights": [..., `YOUR_SAMPLING_WEIGHT`],
       "eval_vocab": [..., SINGING_SOLO_CLASS],
       ...

I noticed that you used temperature-based sampling...

The main point of our paper is that exact temperature-based sampling (of the original MT3) significantly degrades performance. See more details in Appendix G (not F; 😬 found a typo). However, if the datasets are of similar quality, you can weight them proportionally. For example, if your custom singing data is similar in size to MIRST-500, assign them similar weights. It’s okay if the total sum of the added weights exceeds 1.

did you observe the validation results of individual dataset...

Yes. In wandb logger, dataloader_idx is in the same order as the datasets defined in the data_preset.

Joanna1212 commented 2 months ago

thanks, I'll try this with more vocal data. I understand your explanation about the wandb logger. Thank you for your response and advice.

Joanna1212 commented 2 months ago

This pre-release code lacks the unannotated instrument masking (for training) feature, which will be added in an update later this month. I've seen a 1-2% performance improvement, which could be higher with more data.

I tried adding some vocal data. Initially, the metrics showed a slight improvement, but soon there was a gradient explosion. The metrics were slightly better on cmedia and mir_st500. 👍

BTW， Please notify me if there is an update😄. Thanks

mimbres / YourMT3

Question about transcribe only singing voice data #10