Closed cantabile-kwok closed 1 year ago
You could do it but I'm not sure you'd get any meaningful output. What are you trying to achieve?
@sharvil Actually I am not trying to achieve something in purpose. I am just curious about whether this model has enough capacity to generate samples in such complex data distributions (human speech audio like LJspeech) without any condition information. I believe this is feasible in theory, but does the model have to be very very large to achieve this? Glad to hear from your opinions!
My guess is that you'll be able to generate samples that sound like a human voice similar to LJSpeech but you probably won't be able to make out any words.
You can get speech-like output with relatively small models if you've got the right representation. VQ-VAE produces a reasonable representation because the discretized latents map reasonably well to linguistic units. See the "Sampling from Prior" section here for examples.
This is very helpful, appreciate it 👍 @sharvil
Just wondering if this is possible. If possible, how large should this model be ?