Firstly, thank you for making the code publicly available. I am attempting to reproduce the results of the BS-RoFormer paper to achieve similar SDR performance and am referencing your code to do so. However, I am encountering significantly lower performance during reproduction and have a few questions. I would appreciate it if you could answer the following queries:
The hyperparameters I used are as follows:
dim: 384
depth: 6
stereo: True
num_stems: 1
time_transformer_depth: 2
freq_transformer_depth: 2
dim_head: 64
heads: 8
ff_dropout: 0.1
attn_dropout: 0.1
flash_attn: True
mask_estimator_depth: 2
Here are my questions:
Could you please share the model and training hyperparameters you used during training?
The paper mentions using a complex spectrogram as input, but I noticed the code uses torch.view_as_real to handle the input in a CaC manner. I believe this is different from the paper. Could you explain the reason for this difference?
I am running the training on an H100 80GB GPU with the above hyperparameters. Despite slight differences from the paper's hyperparameters, the batch_size of 4 fills up the 80GB memory. Could you let me know what batch_size you used and if there were any additional steps taken to manage memory efficiency? For reference, I used 44.1kHz 8-second audio for both target and mixture inputs, as per the paper's setup.
Your answers would be greatly helpful. Thank you very much!
Hello,
Firstly, thank you for making the code publicly available. I am attempting to reproduce the results of the BS-RoFormer paper to achieve similar SDR performance and am referencing your code to do so. However, I am encountering significantly lower performance during reproduction and have a few questions. I would appreciate it if you could answer the following queries:
The hyperparameters I used are as follows:
dim: 384 depth: 6 stereo: True num_stems: 1 time_transformer_depth: 2 freq_transformer_depth: 2 dim_head: 64 heads: 8 ff_dropout: 0.1 attn_dropout: 0.1 flash_attn: True mask_estimator_depth: 2 Here are my questions:
Could you please share the model and training hyperparameters you used during training? The paper mentions using a complex spectrogram as input, but I noticed the code uses torch.view_as_real to handle the input in a CaC manner. I believe this is different from the paper. Could you explain the reason for this difference? I am running the training on an H100 80GB GPU with the above hyperparameters. Despite slight differences from the paper's hyperparameters, the batch_size of 4 fills up the 80GB memory. Could you let me know what batch_size you used and if there were any additional steps taken to manage memory efficiency? For reference, I used 44.1kHz 8-second audio for both target and mixture inputs, as per the paper's setup. Your answers would be greatly helpful. Thank you very much!