jishengpeng / WavTokenizer

SOTA discrete acoustic codec models with 40 tokens per second for audio language modeling
MIT License
830 stars 46 forks source link

Questions about more detailed experimental results #2

Open hbwu-ntu opened 3 months ago

hbwu-ntu commented 3 months ago

Hi, @jishengpeng thank you for the amazing work. May I ask several questions:

  1. What are the results for large and medium models? Currently, there are only small-model results in the paper
  2. Do you have some ablation study to show the performance gain by incorporating the attention block?
  3. Do you have an ablation study to show the performance gain by changing the decoder similar to VOCOS?
  4. Will you compare your codec model with Single-codec or Ti-Codec? It's hard to compare with Single-codec as it is not open-source. But Ti-codec is open-source. Will you include it in the comparison?
  5. Do you consider the human evaluation, as the current trends between UTMOS and PESQ (STOI) are not consistent? UTMOS is somehow a proxy for human listening, just like DNSMOS. But they are not accurate enough. PESQ and STOI are also good proxies for human listening.
jishengpeng commented 3 months ago

Hi, @jishengpeng thank you for the amazing work. May I ask several questions:

  1. What are the results for large and medium models? Currently, there are only small-model results in the paper
  2. Do you have some ablation study to show the performance gain by incorporating the attention block?
  3. Do you have an ablation study to show the performance gain by changing the decoder similar to VOCOS?
  4. Will you compare your codec model with Single-codec or Ti-Codec? It's hard to compare with Single-codec as it is not open-source. But Ti-codec is open-source. Will you include it in the comparison?
  5. Do you consider the human evaluation, as the current trends between UTMOS and PESQ (STOI) are not consistent? UTMOS is somehow a proxy for human listening, just like DNSMOS. But they are not accurate enough. PESQ and STOI are also good proxies for human listening.

Thank you very much for your interest!

  1. The experiments for the medium and large versions have not been completed due to resource constraints; we are still in the training phase. However, based on the current results, it appears that the medium and large versions will exhibit significantly better generalization in codec reconstruction.
  2. We conducted numerous ablation studies regarding the attention blocks (adding attention blocks at various positions in the encoder and decoder). Negative effects were observed in certain positions. The approach presented in the paper has proven beneficial across hundreds of test samples, but we have yet to rigorously validate it on thousands. Thus, we will include additional experiments in future versions.
  3. Regarding the Vocos decoder, we attempted to replace it with an inverted upsampling structure, but the results were poor. Similar to the previous point, we have established the effectiveness of Vocos, yet we plan to supplement our findings with stricter experiments on thousands of test samples.
  4. According to the results presented in the paper, the UTMOS score for Single-Codec is only 3.0, which is why we did not perform a comparison. The two-encoder approach of Ti-Codec is not particularly elegant, and its performance appears inferior to DAC. We may supplement the Ti-Codec results in a later version.
  5. After listening to a large number of samples, we found that UTMOS results are closer to human auditory perception. PESQ and SOTI are good testing metrics; however, they are not sensitive to certain noise artifacts. It would be ideal to include a subjective metric as well. Although we did not conduct a crowdsourced evaluation, based on my personal listening experience, the results from WavTokenizer are satisfactory

Best regards.

hbwu-ntu commented 3 months ago

Thank you for the answer! Glad to see more numbers in the upcoming Arxiv version.