lucidrains / BS-RoFormer

Implementation of Band Split Roformer, SOTA Attention network for music source separation out of ByteDance AI Labs
MIT License
384 stars 13 forks source link

Potential bug with `num_stems > 1` #20

Closed dorpxam closed 10 months ago

dorpxam commented 10 months ago

Hi @lucidrains

I'm working on a training code to share here and attempt to reproduce the same workflow than the SAMI-ByteDance paper. I will share the code here. In order to make the training code generic, and because I'm pretty newbie in this kind of task, I want to be sure to clearly understand your code.

Num_Stems parameter

EDIT: I remove this very stupid assumption.

Divergence of default parameters

This is a less important, but for better following of the code:

  1. In the "Music Source Separation with Band-Split Rope Transformer" paper:

We use D = 384 for feature dimension, L = 12 for the number of Transformer blocks, 8 heads for each Transformer, and a dropout rate of 0.1.

I don't know if the choice of attn_dropout = 0. and ff_dropout = 0. as default parameters for the Transformers is motivated by some technical considerations. If not, maybe it's better to change the values to follow the original paper?

  1. In the "Mel-Band Roformer for Music Source Separation" paper:

    We use 60 Mel-bands, as it is similar to the number of subbands, i.e., 62, adopted by BS-RoFormer.

Same here with the default parameter num_bands = 62 of the MelBandRoformer?

lucidrains commented 10 months ago

@dorpxam hey, so target and raw_audio are actually two different inputs

target is the expected output for each of stems

and yea, we can refine the hyperparameters a bit more, let me get that done

lucidrains commented 10 months ago

@ZFTurbo the stem feature is working for your kaggle stuff?

dorpxam commented 10 months ago

@lucidrains Oh, I look so dumb now. I hope you understand that my issues are not to criticize or call into question your work. I'm autodidact and learn step by step, so I often say rubbish bigger than me.

That the model is capable of generating the mask of a target from a source and based on a loss function, that makes sense. But generating multiple masks only based on a loss function without the targets being involved in the model is black magic for me :)