When the target_bandwidths parameter changes, how should quant_layer be set here?

theodorblackbird / lina-speech

Official implementation of the TTS model Lina-Speech

Other

138 stars 12 forks source link

processor = AutoProcessor.from_pretrained("facebook/encodec_24khz", local_files_only= False) encodec_model = EncodecModel.from_pretrained("facebook/encodec_24khz", local_files_only= False) #encodec_model.config.target_bandwidths= [6] encodec_model.config.target_bandwidths= [12]

quant_layer is a list of int that slices the codec tensor along the quantizer dimension. In practice we should always set it to range(0, num_quantizers) under the MusicGen delay pattern, I used to experiment different pattern (such as training separately [0,1] [2, ...]) but not anymore. To answer your previous question, align_token is a time alignment of the text transcript. I found random cropping (also mentioned in VALL-E) to be crucial, even if the dataset is well balanced.

Also, I trained all my model at encodec 3kbps = 4 x 10 bits quantizers, but considering DAC soon at higher bitrate. I wonder if it would need some loss balancing, like VoiceCraft did ...

theodorblackbird / lina-speech

When the target_bandwidths parameter changes, how should quant_layer be set here? #7