jishengpeng / Languagecodec

Language-Codec: Reducing the Gaps Between Discrete Codec Representation and Speech Language Models
MIT License
207 stars 16 forks source link

Another bandwidth #4

Open KVandray opened 1 month ago

KVandray commented 1 month ago

Hi!

First of all, thank you for your incredible work! I'm testing now your checkpoint for encoding russian language (24kHz) and it sounds very good after reconstruction. But I've found out that your codec uses only one bandwidth from EnCodec.feature_extractor class (6.6 kbps, so 8 codebooks in result vector). However, in your paper you compared not only 6.0 kbps but also 3.0 (4 codebooks). Will you release checkpoint/code for inference with this bandwidth? Thanks for your answer.

jishengpeng commented 1 month ago

Hi!

First of all, thank you for your incredible work! I'm testing now your checkpoint for encoding russian language (24kHz) and it sounds very good after reconstruction. But I've found out that your codec uses only one bandwidth from EnCodec.feature_extractor class (6.6 kbps, so 8 codebooks in result vector). However, in your paper you compared not only 6.0 kbps but also 3.0 (4 codebooks). Will you release checkpoint/code for inference with this bandwidth? Thanks for your answer.

You can directly utilize our open-source checkpoint for inference with 4 codebooks. Specifically, set n_q=4 at line 134 in the vq.py file located in Languagecodec/languagecodec_encoder/quantization.

KVandray commented 1 month ago

yes, I tested different number of codebooks, but implemented it differently - through codec_to_features() method. I suppose it is the same as your suggestion? And after you train LanguageCodec with 8 codebooks, did you just choose first 4 codebooks for metrics in your paper?

P.S.: interesting thing that speech is appearing only at third codebook reconstructing, probably due to this "Our objective is to include less information in the first channel of the codebook while increasing the missing information on limited channels". And if we're training AR zero-shot tts Vall-e, we should rely only on loss metrics as we cannot reconstruct speech from the first codebook only.

jishengpeng commented 1 month ago

yes, I tested different number of codebooks, but implemented it differently - through codec_to_features() method. I suppose it is the same as your suggestion? And after you train LanguageCodec with 8 codebooks, did you just choose first 4 codebooks for metrics in your paper?

P.S.: interesting thing that speech is appearing only at third codebook reconstructing, probably due to this "Our objective is to include less information in the first channel of the codebook while increasing the missing information on limited channels". And if we're training AR zero-shot tts Vall-e, we should rely only on loss metrics as we cannot reconstruct speech from the first codebook only.

1.Yes, your implementation approach is correct. Indeed, our four-layer evaluation is as you have inferred.

2.It is important to note that during the training process, we do not rely solely on the eight layers for reconstruction. In fact, we randomly select codebooks with more than three layers for reconstruction. This part of the code can be found between lines 100 and 105 in the same vq.py file. Consequently, voice reconstruction is only possible with more than three layers. Additionally, if you use a combination of VALLE and LanguageCodec, you will need to retrain the AR and NAR models to achieve optimal results.

KVandray commented 1 month ago

Well, as I have Valle + EnCodec (8 cbs) pretrained, I'm going to train AR+NAR with LanguageCodec, but the reconstruction will be able only when NAR trainig is completed. Thanks for your answers, I will try this approach.