ruizhecao96 / CMGAN

Conformer-based Metric GAN for speech enhancement
MIT License
309 stars 60 forks source link

Can not reproduce the results #33

Closed hbwu-ntu closed 1 year ago

hbwu-ntu commented 1 year ago

Hi! Your paper and code are excellent! I have learned a lot about speech enhancement from the paper, and I find your code to be very well-structured and clear. Thank you so much!

I can not reproduce the results in your paper. I just want to know some settings to run the experiments

  1. about the loss_weights, do you use the setting in your paper or the setting in your github?
  2. about the epoch number, do you use 50 in the paper or 120 in the github repo?
  3. how do you select the final model for inference?
  4. why you set the utterance length as 16 * 16000 during testing
  5. How do you downsample the audio, could you share the script?
wen0320 commented 1 year ago

Hi, I can't reproduce his results either. How much PESQ can you reproduce so far?

hbwu-ntu commented 1 year ago

Very low, only 3.2. Far more lower than the paper. By the way, how much PESQ can you reproduce?

wen0320 commented 1 year ago

I am trained according to the parameters published in the code, and the data processing is in accordance with the way in the TSTNN code. At 50epoch, the PESQ is 3.24. Then gen_loss starts to rise, and the model does not converge.

hbwu-ntu commented 1 year ago

Thanks for making the implementation details clear. It seems our results are similar. Do you use the same way as the CMGAN github repo to downsample the data to 16k?

wen0320 commented 1 year ago

No, I downsampled the original Voice Bank+DEMAND data to 16K. I think the results in the paper cannot be reproduced because of the hyperparameter and the learning rate.

hbwu-ntu commented 1 year ago

Apart from the learning rate, which other hyperparameters do you consider crucial for reproducing the results? I recall that the authors address the learning rates in the paper. But from my experience, I can not reproduce the results using that learning rate.

wen0320 commented 1 year ago

I think the learning rate in this code is more suitable for speech separation, such as the classic conv-tasnet. I think the hyperparameter of loss is also very important, but I don't know how much weight to assign to each loss can achieve the optimal value.

SherifAbdulatif commented 1 year ago

Sorry for that, but this was never an issue for us and also we didn't get this complain on PESQ. Did you try to use the checkpoint in src/best_ckpt?

SherifAbdulatif commented 1 year ago

Hi! Your paper and code are excellent! I have learned a lot about speech enhancement from the paper, and I find your code to be very well-structured and clear. Thank you so much!

I can not reproduce the results in your paper. I just want to know some settings to run the experiments

  1. about the loss_weights, do you use the setting in your paper or the setting in your github?
  2. about the epoch number, do you use 50 in the paper or 120 in the github repo?
  3. how do you select the final model for inference?
  4. why you set the utterance length as 16 * 16000 during testing
  5. How do you downsample the audio, could you share the script?
  1. For reproducing you can use the checkpoint in src/best_ckpt
  2. Weights are the same as paper, you can find more details here
  3. When the loss saturates, sometimes it can happen between 50 to 75 epochs
  4. In testing the length is variable, but 10 is the maximum time that can run on our GPU. Otherwise we need to split the track.
  5. We already saved as downsampled, you can download from here https://drive.google.com/file/d/1pGV79T3k030f6uc2SbUpuNhfovtmLJxN/view?usp=sharing Alternatively we used the librosa which follow the same downsampler as torch: import librosa audio_down, sr = librosa.load(audio_path, sr=16000)
hbwu-ntu commented 1 year ago

Thank you very much for your warm and detailed response. I will follow the instructions provided and make an effort to reproduce the results:

By the way, I have some follow-up questions:

  1. During testing, why you set the length as variable? Fixing the batch size as 1 for testing won't encounter any GPU OOM issues.
  2. I downloaded your checkpoint from src/best_ckpt and conducted testing. The numbers are slightly worse but is close with the results mentioned in your paper. I'm attempting to reproduce the results. Have you conducted experiments by running multiple trials and calculating the mean and variance of performance? This would help ensure that the positive results are not solely due to good initializations.
SherifAbdulatif commented 1 year ago

Variable length would avoid any normalization issues when splitting the tracks and it is much more convenient than padding tracks to a predefined maximum length or splitting tracks exceeding this length. No actually not the results in the paper are from the best checkpoint not multiple trials, however, your point is a very interesting insight and should be involved in our future studies. Thanks!

SherifAbdulatif commented 1 year ago

However, it is worth mentioning that based on several training trials the results are somehow consistent.