Closed rishikksh20 closed 1 year ago
@rishikksh20 they may be using a slightly different formulation than ddpm. the big idea is they are using the predict x0 objective, same as stable diffusion. i will just stick with the ddpm formulation as afaik, there's not a big difference
the only novelty of the paper is this new rvq cross entropy loss on the right, which is taken care of in a separate repository
the rest of the paper can be summed up as, stable diffusion for audio, with residual quantized vectors from a soundstream. (may be interesting to backport this idea to images actually)
@rishikksh20 they are probably using the formulation from this paper, which was a concurrent paper with Jonathan Ho's ddpm. i am using the continuous formulation of ddpm that is being used predominantly at Google Brain. if you read the latest paper from Yang Song (who is at OpenAI now), it seems he has moved on to using the elucidated ddpm (Karras et al.) in his latest paper
Yeah NaturalSpeech 2 is based on Score base SDE and highly inspired from Grad-TTS. Author haven't explain why they prefer to use both, data loss as well as score loss, usually diffusion model trained on either data loss if we predict x0
or score loss if we predict eps
.
@rishikksh20 i prefer the predict-velocity objective from this paper these days; heard this works well anecdotally. it is included in this repository as objective = 'v'
I try to implement research paper as it is in my first iteration then I would prefer to change the things as the training cost of this model is very very high. I have implemented the whole architecture of Naturalspeech 2 which is little bit different in architecture than this repo, I even matched the number of parameters on paper and my implementation and except Pitch predictor all other modules matched. This loss thing is annoying me I think I should only use data loss and cross entropy loss ignore score loss for now.
@rishikksh20 nice! yea, the more open sourced approaches we have out there the better! at the end, once people start training, we can see if the score based SDE formulation is critical. i highly doubt it, but we'll let the experiments do the talking
For the second loss, there is a typo in the original version, now you can find the correct version in the updated arXiv paper https://arxiv.org/pdf/2304.09116.pdf:
Basically, the data loss and score loss only differ in some scale coefficients after formula derivation, but we found adding the score loss improves the generation quality to some extent.
Now NaturalSpeech 2 can extend to support voice conversion and speech enhancement. See Section 5.7 in the new paper version for the details and the demo page for the results.
Hi Katsuya, the RVQ's weights are trained in codec and are fixed in the diffusion model for calculating the CE-RVQ loss.
From: Katsuya @.> Sent: Tuesday, June 13, 2023 7:18 AM To: lucidrains/naturalspeech2-pytorch @.> Cc: Xu Tan @.>; Mention @.> Subject: Re: [lucidrains/naturalspeech2-pytorch] Discuss details regarding loss function (Issue #11)
@tan-xuhttps://github.com/tan-xu For CE-RVQ loss, I find it very memory intensive since you need to broadcast residual vectors to codebook size (1024) for each residual quantizers (16). Did you do anything special like random sampling of quantizers to mitigate this issue? Otherwise, it seems infeasible to run 6K frames on 32GB GPU.
- Reply to this email directly, view it on GitHubhttps://github.com/lucidrains/naturalspeech2-pytorch/issues/11#issuecomment-1588238209, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ALQV7L7Q2RVFEF4T7KVBXBDXK6PTVANCNFSM6AAAAAAYBFU4NY. You are receiving this because you were mentioned.Message ID: @.**@.>>
@tan-xu Hello, I found the paper (section 4.2) said that
The diffusion model contains 40 WaveNet layers [23], which consist of 1D dilated convolution layers with 3 kernel size, 1024 filter size, and 2 dilation size. Specifically, we use a FiLM layer [39] at every 3 WaveNet layers to fuse the condition information processed by the second Q-K-V attention in the prompting mechanism in the diffusion model.2
But in Appendix B of the paper, i saw
the WaveNet consists of 40 blocks. Each block consists of 1) a dilated CNN with kernel size 3 and dilation 2, 2) a Q-K-V attention, and 3) a FiLM layer
Can you clarify for me which is correct?
Hi @lucidrains,
Needed a little clarification on 2nd loss term , as per author denoiser model is predicting
z0
rather than score so we needed to calculate score for 2nd loss term, as per paper formula is following:pred score = $λ^{-1} (\hat{z_0} - z_t)$
$\hat{z_0}$ is output of the denoiser model and $z_t$ is noisy input but we needed to calculate $λ$ which is variance of the $p(z_t/z_0)$ distribution. So as per your code: $λ$ = sigma from https://github.com/lucidrains/naturalspeech2-pytorch/blob/900581e52534cb3451b4f2715bf8ffa6466c84be/naturalspeech2_pytorch/naturalspeech2_pytorch.py#L1140 and second loss term would be:
So am I calculating $λ$ and score loss correctly or I am missing something?
Thanks