Open ScottishFold007 opened 4 months ago
quant_layer is a list of int that slices the codec tensor along the quantizer dimension. In practice we should always set it to range(0, num_quantizers) under the MusicGen delay pattern, I used to experiment different pattern (such as training separately [0,1] [2, ...]) but not anymore. To answer your previous question, align_token is a time alignment of the text transcript. I found random cropping (also mentioned in VALL-E) to be crucial, even if the dataset is well balanced.
Also, I trained all my model at encodec 3kbps = 4 x 10 bits quantizers, but considering DAC soon at higher bitrate. I wonder if it would need some loss balancing, like VoiceCraft did ...
By the way, the target_bandwidths parameter in encodec, I expanded it from 6 to 12, how should I set it here in quant_layer?