mct10 / RepCodec

Models and code for RepCodec: A Speech Representation Codec for Speech Tokenization
Other
147 stars 10 forks source link

Positional embeddings when training Whisper quantization #3

Closed jpc closed 12 months ago

jpc commented 1 year ago

Hi,

I did a similar thing in my WhisperSpeech TTS project but I was able to get good results without residual quantization after I re-added positional encodings after the quantization bottleneck. The reasoning is that the positional information is needed by the trained Whisper model but because it is the same for all samples, it does not need to pass through the bottleneck and can be easily regenerated.

I glanced at your code and did not see any references to positional encodings. Maybe adding this before the decoding blocks could improve the performance of your Whisper-derived models?

HuangZhiChao95 commented 1 year ago

Thanks for your advice. I have read your timelines in WhisperSpeech TTS project.

We have the same finding as you that k-means of whisper representation results in poor performance. In addition, we also find that while RVQ improves the performance, a single VQ is enough for good performance of RepCodec (The results are mainly reported by a single layer of VQ).

About positional embedding, are you saying that you don't pass positional information to the encoder, but instead readd it to the last layer of the encoder? Does such operation provide good performance even when whisper is frozen?

jpc commented 12 months ago

Hey, sorry I missed your reply...

I am passing the normal positional encodings to the frozen encoder and in addition I am adding some learned embeddings after the VQ bottleneck and before my trainable adapter layer (a single residual attention block with mlp).

Without it I could not get acceptable results – my theory being that the decoder needs positional information but preserving it in the VQ bottleneck would require each quantized value to carry both it's semantic value and it's position.