rhasspy / larynx

End to end text to speech system using gruut and onnx
MIT License
824 stars 49 forks source link

hifi_gan-vctk_small vs hifi_gan-vctk_medium (release 2021-03-28) #10

Closed svenha closed 2 years ago

svenha commented 3 years ago

The naming confuses me a little bit. hifi_gan-vctk_small is larger (and slower) than hifi_gan-vctk_medium.

synesthesiam commented 3 years ago

I wondered this as well, but the labeling from the pre-trained models in the original repo has the "medium" one as vctk_v2 and "small" as vctk_v3. Based on my understanding of the config files, v2 should be larger/slower than v3.

To make it extra confusing, the small/v3 model uses a different "resblock" but more upscale channels than medium/v2, which uses a similar configuration to the universal_large/v1 model.

I may just flip the medium/small labels though if there is an obvious performance difference between the two. I've focused all my testing on the large vs. small to date.

fquirin commented 3 years ago

So I've tested medium and small for a larger number of voices, short and long sentences and small was either equal or even slower (within the error bars I guess).

synesthesiam commented 2 years ago

I ended up swapping the medium/low vocoder labels in v0.5