jishengpeng / Languagecodec

Language-Codec: Reducing the Gaps Between Discrete Codec Representation and Speech Language Models
MIT License
210 stars 16 forks source link

questions about Mask Channel Residual Vector Quantization #2

Closed 0nutation closed 9 months ago

0nutation commented 9 months ago

Thank you for your work. I'm sorry I couldn't find the implementation code for Mask Channel Residual Vector Quantization, and I'm not sure why you named it Language-Codec because I didn't see any language-related loss or other designs.

jishengpeng commented 9 months ago

The core code related to the MCRVQ mechanism can be found in the LanguageVectorQuantization class in the encodec/quantization/core_vq.py file and the ResidualVectorQuantizer class in the encodec/quantization/vq.py file.

jishengpeng commented 9 months ago

The term "Language-Codec" can be understood, at the code level, as a combination of the "encodec" component and the "vocos" component, along with MCRVQ. The logic pertaining to RVQ (Residual Vector Quantization) is implemented in the "languagecodec/encodec/quantization" pathway. The choice of naming our paper "Language-Codec" stems from our objective of minimizing the disparities between discrete codecs and speech language models.

0nutation commented 9 months ago

From the code, it seems that the improvement of MCRVQ is to divide the first layer of the original RVQ into three VQ layers in terms of the time dimension, right?

0nutation commented 9 months ago

Besides the training data size factor, it appears that the way to minimize the disparities between discrete codecs and speech language models is by dispersing the information within the original RVQ-1 token?

jishengpeng commented 9 months ago

Q: "Besides the training data size factor, it appears that the way to minimize the disparities between discrete codecs and speech language models is by dispersing the information within the original RVQ-1 token?"
A: Yes

Q:"From the code, it seems that the improvement of MCRVQ is to divide the first layer of the original RVQ into three VQ layers in terms of the time dimension, right?" A: Not time dimension, mainly in the channel dimension.

hbwu-ntu commented 3 months ago

Hi, thank you for the nice work. I would ask a follow-up question regarding this discussion Q:"From the code, it seems that the improvement of MCRVQ is to divide the first layer of the original RVQ into three VQ layers in terms of the time dimension, right?" A: Not time dimension, mainly in the channel dimension.

From my understanding, the inputs to the RVQ layer is with shape [batch, time, channel_dim] as shown in https://github.com/jishengpeng/Languagecodec/blob/main/languagecodec_encoder/quantization/core_vq.py#L379. Why you name the second dimension as channel dimension rather than time dimension? Or is the input with shape [batch, channel_dim, time]?

If the second dimension is channel_dim, another follow-up question is whether the function of the first three VQ layers is similar to that of one three-group VQ as in https://arxiv.org/abs/2201.09429 (divide the channel into three groups, and each group goes into a VQ layer)?

jishengpeng commented 3 months ago

Hi, thank you for the nice work. I would ask a follow-up question regarding this discussion Q:"From the code, it seems that the improvement of MCRVQ is to divide the first layer of the original RVQ into three VQ layers in terms of the time dimension, right?" A: Not time dimension, mainly in the channel dimension.

From my understanding, the inputs to the RVQ layer is with shape [batch, time, channel_dim] as shown in https://github.com/jishengpeng/Languagecodec/blob/main/languagecodec_encoder/quantization/core_vq.py#L379. Why you name the second dimension as channel dimension rather than time dimension? Or is the input with shape [batch, channel_dim, time]?

If the second dimension is channel_dim, another follow-up question is whether the function of the first three VQ layers is similar to that of one three-group VQ as in https://arxiv.org/abs/2201.09429 (divide the channel into three groups, and each group goes into a VQ layer)?

The input is the shape [batch, channel_dim, time]. About MCRVQ we summarize this in a new issue.