lmnt-com / diffwave

DiffWave is a fast, high-quality neural vocoder and waveform synthesizer.
Apache License 2.0
754 stars 111 forks source link

Is it possible to do unconditional generation in LJspeech? #40

Closed cantabile-kwok closed 1 year ago

cantabile-kwok commented 1 year ago

Just wondering if this is possible. If possible, how large should this model be ?

sharvil commented 1 year ago

You could do it but I'm not sure you'd get any meaningful output. What are you trying to achieve?

cantabile-kwok commented 1 year ago

@sharvil Actually I am not trying to achieve something in purpose. I am just curious about whether this model has enough capacity to generate samples in such complex data distributions (human speech audio like LJspeech) without any condition information. I believe this is feasible in theory, but does the model have to be very very large to achieve this? Glad to hear from your opinions!

sharvil commented 1 year ago

My guess is that you'll be able to generate samples that sound like a human voice similar to LJSpeech but you probably won't be able to make out any words.

You can get speech-like output with relatively small models if you've got the right representation. VQ-VAE produces a reasonable representation because the discretized latents map reasonably well to linguistic units. See the "Sampling from Prior" section here for examples.

cantabile-kwok commented 1 year ago

This is very helpful, appreciate it 👍 @sharvil