Closed 0nutation closed 9 months ago
The core code related to the MCRVQ mechanism can be found in the LanguageVectorQuantization class in the encodec/quantization/core_vq.py file and the ResidualVectorQuantizer class in the encodec/quantization/vq.py file.
The term "Language-Codec" can be understood, at the code level, as a combination of the "encodec" component and the "vocos" component, along with MCRVQ. The logic pertaining to RVQ (Residual Vector Quantization) is implemented in the "languagecodec/encodec/quantization" pathway. The choice of naming our paper "Language-Codec" stems from our objective of minimizing the disparities between discrete codecs and speech language models.
From the code, it seems that the improvement of MCRVQ is to divide the first layer of the original RVQ into three VQ layers in terms of the time dimension, right?
Besides the training data size factor, it appears that the way to minimize the disparities between discrete codecs and speech language models is by dispersing the information within the original RVQ-1 token?
Q: "Besides the training data size factor, it appears that the way to minimize the disparities between discrete codecs and speech language models is by dispersing the information within the original RVQ-1 token?"
A: Yes
Q:"From the code, it seems that the improvement of MCRVQ is to divide the first layer of the original RVQ into three VQ layers in terms of the time dimension, right?" A: Not time dimension, mainly in the channel dimension.
Hi, thank you for the nice work. I would ask a follow-up question regarding this discussion Q:"From the code, it seems that the improvement of MCRVQ is to divide the first layer of the original RVQ into three VQ layers in terms of the time dimension, right?" A: Not time dimension, mainly in the channel dimension.
From my understanding, the inputs to the RVQ layer is with shape [batch, time, channel_dim] as shown in https://github.com/jishengpeng/Languagecodec/blob/main/languagecodec_encoder/quantization/core_vq.py#L379. Why you name the second dimension as channel dimension
rather than time dimension
?
Or is the input with shape [batch, channel_dim, time]?
If the second dimension is channel_dim, another follow-up question is whether the function of the first three VQ layers is similar to that of one three-group VQ as in https://arxiv.org/abs/2201.09429 (divide the channel into three groups, and each group goes into a VQ layer)?
Hi, thank you for the nice work. I would ask a follow-up question regarding this discussion Q:"From the code, it seems that the improvement of MCRVQ is to divide the first layer of the original RVQ into three VQ layers in terms of the time dimension, right?" A: Not time dimension, mainly in the channel dimension.
From my understanding, the inputs to the RVQ layer is with shape [batch, time, channel_dim] as shown in https://github.com/jishengpeng/Languagecodec/blob/main/languagecodec_encoder/quantization/core_vq.py#L379. Why you name the second dimension as
channel dimension
rather thantime dimension
? Or is the input with shape [batch, channel_dim, time]?If the second dimension is channel_dim, another follow-up question is whether the function of the first three VQ layers is similar to that of one three-group VQ as in https://arxiv.org/abs/2201.09429 (divide the channel into three groups, and each group goes into a VQ layer)?
The input is the shape [batch, channel_dim, time]. About MCRVQ we summarize this in a new issue.
Thank you for your work. I'm sorry I couldn't find the implementation code for Mask Channel Residual Vector Quantization, and I'm not sure why you named it Language-Codec because I didn't see any language-related loss or other designs.