theodorblackbird / lina-speech

lina-speech : linear attention based text-to-speech
Other
116 stars 9 forks source link

When the target_bandwidths parameter changes, how should quant_layer be set here? #7

Open ScottishFold007 opened 3 months ago

ScottishFold007 commented 3 months ago

By the way, the target_bandwidths parameter in encodec, I expanded it from 6 to 12, how should I set it here in quant_layer?

processor = AutoProcessor.from_pretrained("facebook/encodec_24khz", local_files_only= False)
encodec_model = EncodecModel.from_pretrained("facebook/encodec_24khz", local_files_only= False)
#encodec_model.config.target_bandwidths= [6]
encodec_model.config.target_bandwidths= [12]
theodorblackbird commented 3 months ago

quant_layer is a list of int that slices the codec tensor along the quantizer dimension. In practice we should always set it to range(0, num_quantizers) under the MusicGen delay pattern, I used to experiment different pattern (such as training separately [0,1] [2, ...]) but not anymore. To answer your previous question, align_token is a time alignment of the text transcript. I found random cropping (also mentioned in VALL-E) to be crucial, even if the dataset is well balanced.

Also, I trained all my model at encodec 3kbps = 4 x 10 bits quantizers, but considering DAC soon at higher bitrate. I wonder if it would need some loss balancing, like VoiceCraft did ...