Closed Joanna1212 closed 2 months ago
Hi @Joanna1212
Can you show me all of your train.py options? That error seems to be related to the encoder/decoder type?
The singing_v1
task is an experimental option. It uses a singing
prefix token, which is not covered in the paper. all_singing_v1
is also just for quick experimentation, with the sampling probability of the singing dataset increased.
args=('yourmt3_only_sing_voice_3' '-tk' 'singing_v1' '-d' 'all_singing_v1' '-dec' 'multi-t5' '-nl' '26' '-enc' 'perceiver-tf' '-sqr' '1' '-ff' 'moe' '-wf' '4' '-nmoe' '8' '-kmoe' '2' '-act' 'silu' '-epe' 'rope' '-rp' '1' '-ac' 'spec' '-hop' '300' '-atc' '1' '-pr' '16-mixed' '-bsz' '12' '12' '-st' 'ddp' '-se' '1000000' CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python train.py "${args[@]}" This way 👆!
I only want to transcribe the singing voice track(single-track prediction).
thanks!
I set confit.py 's "num_channels": from 13 to 1 , it seems work, Let's try the training
@Joanna1212 Sorry for the confusion about the task prefix. I looked further into the code, and in the current version, the 'singing_v1' task is no longer supported. We deprecated using prefix tokens for exclusive transcription of specific instruments due to no performance benefits.
num_channel=1
with a multi-channel T5 decoder, it will behave the same as a single-channel decoder. As mentioned earlier, it will not use any prefix tokens for singing-only
. Currently it is recommended to choose decoder type
as 't5' and task
as 'mt3_full_plus' for single channel decoding. decoder type
as multi-t5
and task
as mc13_full_plus_256
. exc_v1
of config/task.py
).-it
over using se
or epoch-based counting for better managing the cosine scheduler. See https://github.com/mimbres/YourMT3/issues/2#issuecomment-2342031869Thank you for your detailed response. I'll try training your final model. Extracting the singing track (100) through post-processing is very easy. I have already completed it.
However, I noticed some minor errors of singing vocie on some pop music (as you mentioned in your paper). Therefore, I hope to supplement some vocal transcription data to improve the accuracy of vocal transcription.
The dataset I want to add consists of complete songs (vocals mixed with accompaniment and splited vocal track) and the corresponding vocal MIDI, just this one track. I notice you only use vocal track of MIR-ST500, CMedia. Do you think use plenty of converted_Mixture.wav can be better than just online augmentation ? 🫡
Perhaps I should add vocal datasets to the current dataset in "all_cross_final," continuously adding with splited vocal datasets like mir_st500_voc. and keep task of "mc13_full_plus_256" with multi-channel decoder,
or to complete "p_include_singing" part (probability of including singing for cross augmented examples)
Maybe his would enhance vocal performance based on multi-track transcription?
I noticed that you used temperature-based sampling in your paper to determine the proportions of each dataset.
For my scenario, where I am only interested in vocals, Do you think I should adjust the proportion of the singing voice datasets (MIR-ST500, CMedia) to be higher?
Additionally, you mentioned, 'We identified the dataset most prone to over-fitting, as shown by its validation loss curve.' Did you train each dataset separately to observe this, or did you observe the validation results of individual datasets during the overall training? Thanks!
@Joanna1212
Do you think use plenty of converted_Mixture.wav can be better than just online augmentation ?
This pre-release code lacks the unannotated instrument masking (for training) feature, which will be added in an update later this month. I've seen a 1-2% performance improvement, which could be higher with more data.
"all_cross_final"
Yes, I recommend to modify all_cross_final
in data_preset.py
. For example:
"all_cross_final": {
"presets": [
...
`YOUR_DATASET_NAME`
],
"weights": [..., `YOUR_SAMPLING_WEIGHT`],
"eval_vocab": [..., SINGING_SOLO_CLASS],
...
I noticed that you used temperature-based sampling...
The main point of our paper is that exact temperature-based sampling (of the original MT3) significantly degrades performance. See more details in Appendix G
(not F; 😬 found a typo). However, if the datasets are of similar quality, you can weight them proportionally. For example, if your custom singing data is similar in size to MIRST-500, assign them similar weights. It’s okay if the total sum of the added weights exceeds 1.
did you observe the validation results of individual dataset...
Yes. In wandb
logger, dataloader_idx
is in the same order as the datasets defined in the data_preset.
thanks, I'll try this with more vocal data. I understand your explanation about the wandb logger. Thank you for your response and advice.
This pre-release code lacks the unannotated instrument masking (for training) feature, which will be added in an update later this month. I've seen a 1-2% performance improvement, which could be higher with more data.
I tried adding some vocal data. Initially, the metrics showed a slight improvement, but soon there was a gradient explosion. The metrics were slightly better on cmedia and mir_st500. 👍
BTW, Please notify me if there is an update😄. Thanks
Hello,
I am trying to train a model to transcribe only vocal data. I set the parameters as follows: '-tk' 'singing_v1' '-d' 'all_singing_v1', which are the task and training data. However, I encountered an error in the model part: './amt/src/model/t5mod.py', line 633 'b, k, t, d = inputs_embeds.size()', where there are only three dimensions torch.Size([6, 1024, 512]).
How should I modify this to train successfully? Should I set any other parameters? Thanks!