jishengpeng / WavTokenizer

SOTA discrete acoustic codec models with 40 tokens per second for audio language modeling
MIT License
762 stars 42 forks source link

WavTokenizer-mdium is release on 2024.09.09 #23

Open jishengpeng opened 1 month ago

jishengpeng commented 1 month ago

https://huggingface.co/collections/novateur/wavtokenizer-medium-large-66de94b6fd7d68a2933e4fc0

zsLin177 commented 1 month ago

Oh!!! By the way, what's the difference between speech and music-audio? Does music-audio support speech? Also, how do the models listed at WavTokenizer available models correspond to this?

jishengpeng commented 1 month ago

Oh!!! By the way, what's the difference between speech and music-audio? Does music-audio support speech? Also, how do the models listed at WavTokenizer available models correspond to this?

We train WavTokenizer-Medium using training data from different domains. For example, the music-audio version is trained solely on AudioSet(~1500 hours) and music data, which precludes support for speech. Conversely, WavTokenizer-Large will leverage a unified model to support speech, music, and audio simultaneously.

didadida-r commented 1 month ago

!! Thanks for your work, and could you also update the medium result in paper? Because compare to SpeechTokenizer, the out of domain result in small version is not that good

jishengpeng commented 1 month ago

!! Thanks for your work, and could you also update the medium result in paper? Because compare to SpeechTokenizer, the out of domain result in small version is not that good

In out-of-domain scenarios, the WavTokenizer-Medium-Speech version demonstrates improvements over the WavTokenizer-Small version (LJSpeech), with a 0.6 increase in UTmos, a 0.8 increase in PESQ, and a 0.06 increase in STOI. Furthermore, experiments using WavTokenizer-Medium on various languages have shown promising generalization capabilities, suggesting its potential for effective deployment across diverse linguistic contexts. Let's look forward to WavTokenizer-Large.