A bug in model/tts.py - Githubissues

Speaking formally, the shape of variable y_cut_mask from here, might not match the shape of variable y_cut at the last dimension (which is out_size for y_cut). To comprehend, take a look at the function sequence_mask, which we invoke to create y_cut_mask. As parameter max_length is not provided, the length dimension will be of size max(length) (look here). Thus, if all sequences in a batch, provided to GradTTS.forward(...) are shorter than out_size, the last dimension of the shape of y_cut_mask will not match the last dimension of y_cut. An easy experiment can show up an issue. Start training GradTTS with batch_size==1. In that case if there is any sequence shorter than out_size, training will fail with shape mismatch. The fix I suggest is elementary: provide parameter max_length=out_size when calling sequence_mask here. Moreover, we better skip cropping out mel when all sequences in a batch, provided to GradTTS.forward(...) are shorter than out_size. Concrete, I suggest to add condition y_max_length > out_size here.

huawei-noah / Speech-Backbones

A bug in model/tts.py #28