Closed Russell-Izadi-Bose closed 5 months ago
Hi, thank you for your question! This work focuses on generative SE methods and elicits the condition collapse problem in conditional diffusion models. So, most experiments (baselines, ablations) are compared with generative models... and... yes, DR-DiffuSE should be better than Base.
The pre-trained model can be downloaded here: https://github.com/judiebig/DR-DiffuSE/releases/tag/v1.0.0. (please see the "Release" button on the right side of the https://github.com/judiebig/DR-DiffuSE)
We use joint_finetune.py (finetune additional two epochs) and find that the performance slightly improves.
The results is:
Voicebank
Base
{'test_mean_csig': 4.390326794116049, 'test_mean_cbak': 3.5873292531912213, 'test_mean_covl': 3.7635391559208635, 'test_mean_pesq': 3.0882322788238525, 'test_mean_ssnr': 9.794663913082164, 'test_mean_stoi': 0.9484222835928453}
DR-DiffuSE
{'test_mean_csig': 4.376428482730735, 'test_mean_cbak': 3.5494795946836346, 'test_mean_covl': 3.742056515878191, 'test_mean_pesq': 3.063969696465048, 'test_mean_ssnr': 9.38235410664146, 'test_mean_stoi': 0.9491473422523825}
CHIME-4
Base
{'test_mean_csig': 3.0838875300415776, 'test_mean_cbak': 2.627851942026347, 'test_mean_covl': 2.443433905192998, 'test_mean_pesq': 1.8535393489129615, 'test_mean_ssnr': 5.32511011519281, 'test_mean_stoi': 0.9197356345337052}
DR-DiffuSE
{'test_mean_csig': 3.1085706859198368, 'test_mean_cbak': 2.6527395251714823, 'test_mean_covl': 2.47250678702117, 'test_mean_pesq': 1.881716514988379, 'test_mean_ssnr': 5.3934482878713395, 'test_mean_stoi': 0.9234147534103554}
You can retest our released model, and feel free to ask me if you have any further questions!
The main contributions of this work are two-fold: elicit "condition collapse" in conditional diffusion models, and then define some strategies (complicated architecture, explicit diffusion guidance, refine output using a deterministic model) to address this problem.
However, based on my current taste, I wouldn't say I like the way of addressing the condition collapse problem in this work: The core component in this work is Base (provides good condition signals and refines the diffusion model). The capacity of this framework is mainly constrained by Base, and some comparisons in Table 2 are not fair: we couldn't say generative methods are better if we compare several methods with different architectures. We should use the same architecture for all methods instead, as we did in DOSE, our NIPS '23 paper.
BTW, this work can also be seen as using a generative model to generate some augmented data, and then using this data to help train a more robust/generalizable deterministic model (Base). The above procedure is very sensitive to the quality and diversity of the generated data, so it is not stable. In addition, the ratio of the index gains achieved to the cost of generating data is quite low, making it impractical in practice. I'd be more likely to recommend reading DOSE.
I have trained the Base model and it matches the results reported in the paper. However, it does not use any diffusion process and it is reported as DR-DiffuSE in the paper. Should I expect the model perform better than the results in the paper? What is source of confusion here?
Also, in the readme, it says pre-trained models are uploaded. where can I find them?
Thanks!