Two questions about DiffVC

Hello, thank you for sharing this excellent work. After briefly browsing the code, I have two questions: (1) What is the use of x_ref ? During training it seems to be a different fragment of the same mel-spectrogram as x. And to which part of the paper does it correspond? (2) Why do we need to perform a weighted summation of mean and x? Does this mean that the reverse diffusion during inference starts from the weighted mean_x? I'm new to diffusion models and don't quite understand the theory in the paper, so sorry if I asked some stupid questions.

huawei-noah / Speech-Backbones

Two questions about DiffVC #31