lmnt-com / diffwave

DiffWave is a fast, high-quality neural vocoder and waveform synthesizer.
Apache License 2.0
767 stars 112 forks source link

Long sentences #8

Closed alexdemartos closed 2 years ago

alexdemartos commented 3 years ago

Hi,

the model seems to be working fairly well (tested after just 100K steps on a 100 speaker 24KHz dataset, it starts sounding reasonably well, but I guess it needs more epochs to achieve higher quality).

I just tested it on some random sentences, and I noticed the GPU ran out of memory for long sentences. What would be the best approach to synthesize long sentences? The baseline would be to split the mel spectrogram in parts and synthesize them separately, but I am not sure if this is the only way to go.

Thank you for your help!

PD: I'll report some results after 1M steps.

sharvil commented 3 years ago

Splitting and synthesizing will work, though you may experience boundary effects near the split. You may want to split near silence.

The seamless way to do this is to modify the predict code to send in audio and spectrogram in chunks where each chunk is overlapped by the receptive field of the WaveNet decoder.

Out of curiosity, how long are your utterances?