hhguo / SoCodec

Ultra-low-bitrate Speech Codec for Speech Language Modeling Applications
MIT License
58 stars 3 forks source link

Questions about details #1

Closed hbwu-ntu closed 3 weeks ago

hbwu-ntu commented 1 month ago

Hi @hhguo thank you for the amazing work. May I have several questions:

  1. Will you be releasing the code for both codec training and language model (LM) training?
  2. PQ is mentioned without a definition. What is the difference between PQ and OPQ?
  3. How is Socdec-120ms implemented? Do you average the 6 HuBERT embeddings and then apply OPQ?
  4. What are the training resources for codec and TTS-LM?
  5. There is a typo in Section 6.1. It should refer to Table 1, not Fig. 1.
hhguo commented 4 weeks ago

Hi Haibin,

Thanks for your attention.

  1. I am training models based on a complicated training code. I will try to extract the related modules for open-sourcing, but it may take more time.
  2. PQ is "product quantization", OPQ is the proposed "ordered product quantization".
  3. I use the last hidden state of Hubert as the input of a codec model, the VQ layer is applied with OPQ.
  4. codec and TTS-LM are both trained with WenetSpeech4TTS
  5. Thanks for your comment! It will be corrected in the updated arxiv paper.