Clarifications Needed for Reproducing BS-RoFormer SDR Performance

Hello,

Firstly, thank you for making the code publicly available. I am attempting to reproduce the results of the BS-RoFormer paper to achieve similar SDR performance and am referencing your code to do so. However, I am encountering significantly lower performance during reproduction and have a few questions. I would appreciate it if you could answer the following queries:

The hyperparameters I used are as follows:

dim: 384 depth: 6 stereo: True num_stems: 1 time_transformer_depth: 2 freq_transformer_depth: 2 dim_head: 64 heads: 8 ff_dropout: 0.1 attn_dropout: 0.1 flash_attn: True mask_estimator_depth: 2 Here are my questions:

Could you please share the model and training hyperparameters you used during training? The paper mentions using a complex spectrogram as input, but I noticed the code uses torch.view_as_real to handle the input in a CaC manner. I believe this is different from the paper. Could you explain the reason for this difference? I am running the training on an H100 80GB GPU with the above hyperparameters. Despite slight differences from the paper's hyperparameters, the batch_size of 4 fills up the 80GB memory. Could you let me know what batch_size you used and if there were any additional steps taken to manage memory efficiency? For reference, I used 44.1kHz 8-second audio for both target and mixture inputs, as per the paper's setup. Your answers would be greatly helpful. Thank you very much!

lucidrains / BS-RoFormer

Clarifications Needed for Reproducing BS-RoFormer SDR Performance #34