NANs in CTC loss: solution

kfmn commented 1 month ago

Hi,

I an my colleagues have faced the problem of appearing NANs in CTC loss and interruption of training due to too much infinite gradients when training Zipformer on our data. After some debugging we found a reason for this behaviour and I would like to share this finding for it to be fixed in all related recipes.

In training script there is a piece of code which filters out pronunciations (token sequences) which are too long to be aligned successfully with feature sequences: https://github.com/k2-fsa/icefall/blob/f84270c93528f4b77b99ada9ac0c9f7fb231d6a4/egs/librispeech/ASR/zipformer/train.py#L1326

This code accounts for only non-blank tokens from SentencePiece tokenizer. However, when there are two or more identical tokens in a row, they must be separated by the blank token for CTC alignment! So in cases where condition above is satisfied it is still possible to fail computing CTC loss. So we suggest to fix this in such a way:

        T = ((cut.num_frames - 7) // 2 + 1) // 2
        tokens = sp.encode(cut.supervisions[0].text, out_type=str)
        num_tokens = len(tokens)
        for i in range(1, len(tokens)):
            if tokens[i] == tokens[i - 1]:
                num_tokens += 1
        if T < num_tokens:

After this correction no NANs appears in CTC losses anymore. It seems that similar bug exists in most of training scripts related to CTC-based training.

csukuangfj commented 1 month ago

Thanks for sharing!

Could you create a pull-request to integrate your proposal?

By the way, is the audio causing NaNs very short?

kfmn commented 1 month ago

Sorry I don't know how to make pull-requests indeed ( Audio is not very short, we faced this when working with grapheme tokenizer, the number of graphemes is sometimes indeed comparable to T

k2-fsa / icefall

NANs in CTC loss: solution #1777