Quality on lower bandwidth?

jishengpeng / WavTokenizer

SOTA discrete acoustic codec models with 40 tokens per second for audio language modeling

MIT License

791 stars 43 forks source link

Quality on lower bandwidth? #5

Open OnceJune opened 2 months ago

OnceJune commented 2 months ago

Hi, this is an amazing work that combines understanding and generation into one single tokenizer. Have you guys tried lower bandwidth, e.g. less than 20 or even 15 tokens per second?

jishengpeng commented 2 months ago

Hi, this is an amazing work that combines understanding and generation into one single tokenizer. Have you guys tried lower bandwidth, e.g. less than 20 or even 15 tokens per second?

Using 40 tokens to represent speech at a 24kHz sampling rate is equivalent to using 26 tokens for audio at a 16kHz sampling rate. We have experimented with reconstructing 24kHz audio using 25 tokens ago. While reconstruction is possible, the perceived quality may not be entirely satisfactory to the human ear (though it might be acceptable to some). Our goal is to achieve high-quality reconstruction with the minimal number of tokens.

OnceJune commented 2 months ago

Hi, this is an amazing work that combines understanding and generation into one single tokenizer. Have you guys tried lower bandwidth, e.g. less than 20 or even 15 tokens per second?

Using 40 tokens to represent speech at a 24kHz sampling rate is equivalent to using 26 tokens for audio at a 16kHz sampling rate. We have experimented with reconstructing 24kHz audio using 25 tokens ago. While reconstruction is possible, the perceived quality may not be entirely satisfactory to the human ear (though it might be acceptable to some). Our goal is to achieve high-quality reconstruction with the minimal number of tokens.

Thank you for the answer.

Liujingxiu23 commented 2 months ago

@jishengpeng "We have experimented with reconstructing 24kHz audio using 25 tokens ago. While reconstruction is possible, the perceived quality may not be entirely satisfactory to the human ear (though it might be acceptable to some).“ Is there any possibility that you can share the model of this config or share some wave-samples-reconstruction？