Open jishengpeng opened 1 month ago
Oh!!! By the way, what's the difference between speech and music-audio? Does music-audio support speech? Also, how do the models listed at WavTokenizer available models correspond to this?
Oh!!! By the way, what's the difference between speech and music-audio? Does music-audio support speech? Also, how do the models listed at WavTokenizer available models correspond to this?
We train WavTokenizer-Medium using training data from different domains. For example, the music-audio version is trained solely on AudioSet(~1500 hours) and music data, which precludes support for speech. Conversely, WavTokenizer-Large will leverage a unified model to support speech, music, and audio simultaneously.
!! Thanks for your work, and could you also update the medium result in paper? Because compare to SpeechTokenizer, the out of domain
result in small version is not that good
!! Thanks for your work, and could you also update the medium result in paper? Because compare to SpeechTokenizer, the
out of domain
result in small version is not that good
In out-of-domain scenarios, the WavTokenizer-Medium-Speech version demonstrates improvements over the WavTokenizer-Small version (LJSpeech), with a 0.6 increase in UTmos, a 0.8 increase in PESQ, and a 0.06 increase in STOI. Furthermore, experiments using WavTokenizer-Medium on various languages have shown promising generalization capabilities, suggesting its potential for effective deployment across diverse linguistic contexts. Let's look forward to WavTokenizer-Large.
https://huggingface.co/collections/novateur/wavtokenizer-medium-large-66de94b6fd7d68a2933e4fc0