Open OnceJune opened 2 months ago
Hi, this is an amazing work that combines understanding and generation into one single tokenizer. Have you guys tried lower bandwidth, e.g. less than 20 or even 15 tokens per second?
Using 40 tokens to represent speech at a 24kHz sampling rate is equivalent to using 26 tokens for audio at a 16kHz sampling rate. We have experimented with reconstructing 24kHz audio using 25 tokens ago. While reconstruction is possible, the perceived quality may not be entirely satisfactory to the human ear (though it might be acceptable to some). Our goal is to achieve high-quality reconstruction with the minimal number of tokens.
Hi, this is an amazing work that combines understanding and generation into one single tokenizer. Have you guys tried lower bandwidth, e.g. less than 20 or even 15 tokens per second?
Using 40 tokens to represent speech at a 24kHz sampling rate is equivalent to using 26 tokens for audio at a 16kHz sampling rate. We have experimented with reconstructing 24kHz audio using 25 tokens ago. While reconstruction is possible, the perceived quality may not be entirely satisfactory to the human ear (though it might be acceptable to some). Our goal is to achieve high-quality reconstruction with the minimal number of tokens.
Thank you for the answer.
@jishengpeng "We have experimented with reconstructing 24kHz audio using 25 tokens ago. While reconstruction is possible, the perceived quality may not be entirely satisfactory to the human ear (though it might be acceptable to some).“ Is there any possibility that you can share the model of this config or share some wave-samples-reconstruction?
Hi, this is an amazing work that combines understanding and generation into one single tokenizer. Have you guys tried lower bandwidth, e.g. less than 20 or even 15 tokens per second?