Open saveriyo opened 2 months ago
I've attempted to use WavTokenizer as the encoder for a speech separation pipeline and finding initial worse results relative to DAC. Is this potentially expected given the further compression and optimization for speech based tasks?
Could you kindly provide more details regarding the experimental setup? Specifically, which version of the WavTokenizer did you use—small or medium? Additionally, how many quantizers were employed in DAC, and could you share more specifics about the implementation details of the speech separation, particularly in relation to the dataset used? These details will help us better evaluate your approach.
Furthermore, I have noticed that you have submitted several pull requests. Please be patient, as we will review and incorporate them in our upcoming updates.
Thank you for your cooperation.
Best regards, shengpeng.
Hi Shengpeng, thank you for the quick response and considering my PR. As a quick experiment, I've been trying out using WavTokenizer in Codecformer (https://arxiv.org/abs/2406.12434) to replace DAC, acknowledging speech separation isn't necessarily an intended use-case. After running a few trials, I keep seeing SI-SNR of separated speech plateau around -12 dB with WavTokenizer, where DAC is able to get to 6+ dB.
Please see my rough draft PR here if you're interested in the Speechbrain training implementation (credit to @Yip-Jia-Qi for codecformer): https://github.com/Yip-Jia-Qi/codecformer/pull/2/files
While it's likely there is a flaw or bug in the implementation above, I'm wondering if inherently WavTokenizer would be expected to perform worse than DAC on a task like speech separation.
Hi Shengpeng, thank you for the quick response and considering my PR. As a quick experiment, I've been trying out using WavTokenizer in Codecformer (https://arxiv.org/abs/2406.12434) to replace DAC, acknowledging speech separation isn't necessarily an intended use-case. After running a few trials, I keep seeing SI-SNR of separated speech plateau around -12 dB with WavTokenizer, where DAC is able to get to 6+ dB.
Please see my rough draft PR here if you're interested in the Speechbrain training implementation (credit to @Yip-Jia-Qi for codecformer): https://github.com/Yip-Jia-Qi/codecformer/pull/2/files
While it's likely there is a flaw or bug in the implementation above, I'm wondering if inherently WavTokenizer would be expected to perform worse than DAC on a task like speech separation.
Hello, in my opinion, the 75-token version of WavTokenizer should be at least comparable to the 900-token DAC version in the speech separation task. I am unsure whether you are using the WavTokenizer-small or WavTokenizer-medium version. If you are using the WavTokenizer-small version, the result is reasonable, as WavTokenizer exhibits minimal generalization ability. Please also stay tuned for the release of our WavTokenizer-large version, which we plan to open-source. ❤
Ah ok thanks, I was using WavTokenizer-small. I will try again using medium and see if there are better results!
Hi @saveriyo, did you see any improvements with the large model? I need to implement a pipeline for radio program separation and was curious about your speech separation results.
I managed to run WavTokenizer with codecformer, i can't remember small or medium, but I managed to get slightly lower performance to DAC for PESQ and STOI, and WavTokenizer outperforms DAC on DNSMOS-OVRL. Definitely worth trying out, although I would train this on embedding loss (i.e. during training skip the decoder anyway since is frozen, and just train on MSE loss using the embeddings) its computationally more efficient.
Thanks!
I've attempted to use WavTokenizer as the encoder for a speech separation pipeline and finding initial worse results relative to DAC. Is this potentially expected given the further compression and optimization for speech based tasks?