Open ZFTurbo opened 11 months ago
there's a bunch of techniques
gradient accumulation is what you can immediately use, but also look into gradient checkpointing etc
I use gradient accumulation now. But in paper authors report model fit in 32 GB with batch size = 6.
how many GPUs do you have?
ah, well I can't help you with that, but this is a common issue and you aught to be able to figure it out with the resources online
the biggest gun I can bring out would be reversible networks, which could work for this architecture. maybe at a later date. for now, maybe accumulate gradients and wait it out?
are you using mixed precision with flash attention turned on?
I use mixed precision, but I'm not sure about Flash Attention.
how many GPUs do you have?
I usually train on 1 GPU )
def turn on flash attention if you have right hardware
it is some flag on init
Wow! Reversible networks would be cool
@ZFTurbo Are we confident that MelRoFormer trained upon stereo stems? (If not, that would halve their effective batch size, if they trained mono stems.) Perhaps they didn't because: a) Bandsplit-RNN didn't either, as far as I know and b) training individual models per stem suggests that, like Bandsplit-RNN, they didn't add multistem/multichannel functionality like @lucidrains did.
HTDemucs and variants, on the other hand, trained all stems simultaneously, and then fine-tuned on individual stems. This leads to models 4x the size, but I am a little surprised the authors didn't possibly include this easy win, given how many GPUs they used to train.
@ZFTurbo how do you collect the songs?
@ZFTurbo Do you intend to release a trained model? Looking forward to test this approach against Demucs, but I don't have the means or knowledge to train a model.
@ZFTurbo Do you intend to release a trained model? Looking forward to test this approach against Demucs, but I don't have the means or knowledge to train a model.
I posted some weights here: https://github.com/ZFTurbo/Music-Source-Separation-Training/tree/main?tab=readme-ov-file#vocal-models
But SDR metric is not really great
Thank you a lot! I really appreciate your work. So, the crown still goes to "MDX23C for vocals + HTDemucs4 FT" currently, right? How do you think they achieved a score so high in the MDX'23?
How do you think they achieved a score so high in the MDX'23?
Only BS-Roformer was evaluated during MDX23 by ByteDance-SAMI, Mel-Roformer came after the contest.
How do you think they achieved a score so high in the MDX'23?
Only BS-Roformer was evaluated during MDX23 by ByteDance-SAMI, Mel-Roformer came after the contest.
Oh, I see. I got caught up with the names. Are you aware of any trained BS-RoFormer model?
I'm trying to reproduce the paper model. But I have no luck.
My current settings which gave me batch size only 2 for 48 GB memory:
On input I give 8 seconds of 44100Hz so length is 352800.
I run my code model through torchinfo:
Report is:
From report I expect to have batch more than 48. But in the end I can use batch only 2.
GPU memory usage for batch = 2:
To follow the paper I must increase dim to 384, depth to 12 and decrease stft_hop_length to 441 - to be 10 ms. In this case batch size will be only 1 or not fit in memory )
Any ideas how to deal with such big memory usage?