sungwon23 / BSRNN

79 stars 13 forks source link

Questions about code reproduction effects #2

Closed KarmaYan closed 1 year ago

KarmaYan commented 1 year ago

First of all, thank you for your work. I would like to ask if the results obtained by that code can get the results described in the paper and if they can beat MTFAA and FRCRN in terms of objective scores?

sungwon23 commented 1 year ago

Thank you for your appreciation.

Unfortunately, I implemented this model with parameter settings and dataset that i can get the result in several days with my single GPU (GTX 1660 SUPER). So, this is the more like performance test of model structure on different environment, and it's hard to tell that this implementation got the higher objective scores than those models.

KarmaYan commented 1 year ago

In other words, the current project is consistent with the paper in terms of network structure and so on, but due to the different training environment and configuration and so on, there is no guarantee that the final results obtained are exactly the same as the paper, right?

sungwon23 commented 1 year ago

yes, you are right.

KarmaYan commented 1 year ago

yes, you are right. Thank you for your patience in answering! I'll close the question and come back to it later if I have any questions.

KarmaYan commented 1 year ago

yes, you are right. Thank you for your patience in answering! I'll close the question and come back to it later if I have any questions.

Sorry to bother you again, I would like to ask if you know the number of parameters and the amount of calculations for the model? I see that there is no mention in the papers of this series. And finally is this algorithm strictly real-time?

sungwon23 commented 1 year ago

Number of parameters and amount of calculations can be checked using torchinfo, though you should change the input tensor complex to real beforehand. I mainly just checked the training time during testing the model.

About real-time implementation, this model process chunk in seconds to produce output. So current algorithm cannot be implemented in real-time. For the real-time implementation, i think time-axis LSTM cell calculation should be modified to frame-wise calculation.

KarmaYan commented 1 year ago

Number of parameters and amount of calculations can be checked using torchinfo, though you should change the input tensor complex to real beforehand. I mainly just checked the training time during testing the model.

About real-time implementation, this model process chunk in seconds to produce output. So current algorithm cannot be implemented in real-time. For the real-time implementation, i think time-axis LSTM cell calculation should be modified to frame-wise calculation.

Is “checkpoint” the best model you've trained?

sungwon23 commented 1 year ago

yes.

KarmaYan commented 1 year ago

yes.

I experimented with your pre-trained model on my own test dataset and it is excellent at removing noise, but at the same time it removes normal speech as well, resulting in a not so good performance of pesq. Have you encountered such a situation?

sungwon23 commented 1 year ago

In my personal experience, VCTK only training is not sufficient for general speech enhancement task especially in poorly recorded (voice) test sample or sample that have different sampling rate. If you want more robust performance, i recommend to train model with dataset larger than VCTK.

KarmaYan commented 1 year ago

Reference in n

Thank you for your answer. If I have time, I'll try training with the DNS data mentioned in the thesis.

KarmaYan commented 1 year ago

Reference in n

Thank you for your answer. If I have time, I'll try training with the DNS data mentioned in the thesis.

Hi, when I was researching the code, I noticed that in the module of subband splitting, did you split only 31 subbands? I remember that the best way to cut the subbands as described in the paper is:“We split the frequency band below 1k Hz by a 100 Hz bandwidth, split the frequency band between 1k Hz and 4k Hz by a 250 Hz bandwidth, split the frequency band between 4k Hz and 8k Hz by a 500 Hz bandwidth, split the frequency band between 8k Hz and 16k Hz by a 1k Hz bandwidth, split the frequency band between 16k Hz and 20k Hz by a 2k Hz bandwidth, and treat the rest as one subband. This results in 41 subbands.”

sungwon23 commented 1 year ago

used samplerate in paper (https://arxiv.org/abs/2212.00406) is 16khz.

KarmaYan commented 1 year ago

used samplerate in paper (https://arxiv.org/abs/2212.00406) is 16khz.

Hi, I am very happy now because after I trained with DNS data, the model has excellent test results, but there is currently a situation where the training process has a loss of NAN, I wonder if you have come across it? If it is convenient, can you add a WeChat for easier contact?

sungwon23 commented 1 year ago

I'm also glad you get the good test result, and sorry I never used wechat before.

The reason for Nan value is hard to identify. it can be different whether code or data or anything else. So i can't not give you specific advice, but i think at first that it could be zero valued data caused the problem.

KarmaYan commented 1 year ago

I also glad you get the good test result, and sorry I never used wechat before.

The reason for Nan value is hard to identify. it can be different whether code or data or anything else. So i can't not give you specific advice, but i think at first that it could be zero valued data caused the problem.

After my initial analysis, because I am using a large training set and learning enough per epoch model, then a learning rate of only 0.98 per ten epochs may cause the network gradient to explode.

KarmaYan commented 1 year ago

I also glad you get the good test result, and sorry I never used wechat before. The reason for Nan value is hard to identify. it can be different whether code or data or anything else. So i can't not give you specific advice, but i think at first that it could be zero valued data caused the problem.

After my initial analysis, because I am using a large training set and learning enough per epoch model, then a learning rate of only 0.98 per ten epochs may cause the network gradient to explode.

Hi, I found that as the discriminator runs, the model does improve the PESQ of the test data, but at the same time it decreases its denoising ability, is this normal? Can such a problem be solved by changing the loss function?

KarmaYan commented 1 year ago

I'm also glad you get the good test result, and sorry I never used wechat before.

The reason for Nan value is hard to identify. it can be different whether code or data or anything else. So i can't not give you specific advice, but i think at first that it could be zero valued data caused the problem.

Changing the loss function does improve the noise reduction ability of the algorithm to some extent, but this is often based on reducing speech retention. Is there any way to improve both NMOS and SMOS? Currently stuck in a bottleneck, looking forward to your ideas.

KarmaYan commented 1 year ago

I'm also glad you get the good test result, and sorry I never used wechat before.

The reason for Nan value is hard to identify. it can be different whether code or data or anything else. So i can't not give you specific advice, but i think at first that it could be zero valued data caused the problem.

Sorry to bother you again, in the process of comparing the code and the paper, I found that the paper in Lg loss should use the complex-valued spectrogram of the signal to do the calculation, but the code chose the spectrogram after the signal was compressed, may I ask what is the basis for you to write this way? Will this affect the performance of the algorithm?

sungwon23 commented 1 year ago
  1. As you already know, PESQ and other objective metrics are not identical to MOS. Bridge the gap between this two metrics is very hard problem. You can use other metric as target loss to improve your target score, or try general approach to enhance model performance such as using more layers.
  2. Compressed spectrogram is used from CMGAN. I used this method several times and this method give slight improved result most of times. I didn't have the problem that this method is the cause.
unemployed-denizen commented 1 year ago

used samplerate in paper (https://arxiv.org/abs/2212.00406) is 16khz.

Hi, I am very happy now because after I trained with DNS data, the model has excellent test results, but there is currently a situation where the training process has a loss of NAN, I wonder if you have come across it? If it is convenient, can you add a WeChat for easier contact?

I think you could simply do gradient clipping. As the author mentioned: they clipped them to 5 in this paper.

https://arxiv.org/pdf/2209.15174.pdf