iamycy / diffwave-sr

https://iamycy.github.io/diffwave-sr/
MIT License
79 stars 8 forks source link

About LSD metric #8

Open QA-MDT opened 2 months ago

QA-MDT commented 2 months ago

Hi, I am now working on the evaluation on audio super metrics, and i am wondering whether the LSD metric lead to sub-optimal results? For example, the following STFT-image consists of three systems(the first one is ground truth, and the following two ones are the comparation of two super resolution systems) It may be obviously that the second one is better than the third, but it suffers a bad LSD. As it is mentioned in AudioSR, we can also see that LSD score unmatch with subjective MOS score, so i am just wondering about the replacement or analysis of this metric? thanks a lot again for your excellent work. . image

yoyolicoris commented 2 months ago

Unless the methods are all comparable to each other, LSD is quite enough to indicate the relative performance differences. For me, I don't think the second one looks better than the third, so I'm not surprised it has bad LSD.

QA-MDT commented 2 months ago

Thanks for your response. I have some others little questions, In my view, "MCG" is used for unconditional models such that score(x|y) = score(x) + score (y|x), and MCG is used to estimate score (y|x). So i am wondering why it can be used into conditional models such as "nu-wave2", this may be of my misunderstanding of this constraint. Thanks again and looking forward to your reply!

QA-MDT commented 2 months ago

Unless the methods are all comparable to each other, LSD is quite enough to indicate the relative performance differences. For me, I don't think the second one looks better than the third, so I'm not surprised it has bad LSD.

Yes, however within subjective experienment and other object metrics such as Si-SNR, PSNR, SSIM, system 2 benifits system 3. thus in my opinion, i commit that LSD is a useful and accurate metric and system 2 hasn't gain good enough performance , however I don't think such point-to-point metrics are a good measure of super-resolution tasks. (For example, Noise points often occurs in the ultra-high frequency part of the spectrogram, which will significantly affect the judgment of model performance)

yoyolicoris commented 2 months ago

Thanks for your response. I have some others little questions, In my view, "MCG" is used for unconditional models such that score(x|y) = score(x) + score (y|x), and MCG is used to estimate score (y|x). So i am wondering why it can be used into conditional models such as "nu-wave2", this may be of my misunderstanding of this constraint. Thanks again and looking forward to your reply!

Assuming the approximation of $p(x|y)$ is accurate enough, the conditional score after applying MCG becomes $\nabla p(x) + 2 \nabla p(y|x)$, with the emphasis more on fitting the likelihood function $p(y|x)$. Prior works empirically show that it gets better results https://arxiv.org/abs/2207.12598 (see the classifier guidance section).

QA-MDT commented 2 months ago

Thanks for your response. I have some others little questions, In my view, "MCG" is used for unconditional models such that score(x|y) = score(x) + score (y|x), and MCG is used to estimate score (y|x). So i am wondering why it can be used into conditional models such as "nu-wave2", this may be of my misunderstanding of this constraint. Thanks again and looking forward to your reply!

Assuming the approximation of p ( x | y ) is accurate enough, the conditional score after applying MCG becomes ∇ p ( x ) + 2 ∇ p ( y | x ) , with the emphasis more on fitting the likelihood function p ( y | x ) . Prior works empirically show that it gets better results https://arxiv.org/abs/2207.12598 (see the classifier guidance section).

Thank you for your thorough explanation!

QA-MDT commented 2 months ago

sorry bother again, I am also confused about this section of code in "reverse_manifold". Specifically, from your essay, i would write: mu -= lr(g - F_h/2(g)). (1) However found in your code : mu -= lr g; (2) while "mu -= F_h/2(mu); mu += F_h/2(z_t) var_st[s] / alpha_st[s] + alpha[s] c[s] * y_hat" is doing the inpainting right? Thus i am confused about the mismatch between (1) and (2), thanks

QA-MDT commented 2 months ago

I fall into another question, that is why you segment and overlapped the original audio waveforms when calculating the grad? I found that your segment size is 144000 // 2 = 72000, while your training window size is 32768, do they have some correlationships?

yoyolicoris commented 2 months ago

sorry bother again, I am also confused about this section of code in "reverse_manifold". Specifically, from your essay, i would write: mu -= lr(g - F_h/2(g)). (1) However found in your code : mu -= lr g; (2) while "mu -= F_h/2(mu); mu += F_h/2(z_t) var_st[s] / alpha_st[s] + alpha[s] c[s] * y_hat" is doing the inpainting right? Thus i am confused about the mismatch between (1) and (2), thanks

The code mixes the inpainting and MCG for efficiency. Basically, the steps mu -= lr * g and mu -= F_h/2(mu) combined are (1) + the first step for inpainting.

yoyolicoris commented 2 months ago

I fall into another question, that is why you segment and overlapped the original audio waveforms when calculating the grad? I found that your segment size is 144000 // 2 = 72000, while your training window size is 32768, do they have some correlationships?

The numbers are set empirically and the main concern is the available GPU memory. If you have more VRAM I think you can safely increase the segment size.