Negative Losses in CTC Training

sanchit-gandhi commented 2 years ago

Training the baseline CTC model on the Common Voice 9 (CV9) dataset, we observe that the training loss drops below zero after ~1.5k train steps: https://wandb.ai/sanchit-gandhi/commonvoice_9_0/runs/y593pwm4?workspace=user-sanchit-gandhi. The CTC loss should be strictly nonnegative.

CV9 tokenizer: working as expected. Tested within the training script (both tokenising, and decoding), and checked that all attributes all set correctly. Furthermore, the target string in the wandb predictions logs are identical to the transcribed text in the training data -> the tokenizer is correctly tokenising and decoding.
Logits test: for the randomly initialised (unscanned) model, the PT-Flax equivalence test passes on CV9 for both the logits and losses. Loss is nonnegative. https://github.com/sanchit-gandhi/seq2seq-speech/blob/main/tests/check_flax_ctc_cv9.py
Trained Flax model: using the 50k train steps checkpoint, the loss is negative.
We expect to see all log probabilities in the CTC loss function be strictly negative: the probabilities should lie in the range 0 to 1, and so should have a max value of log(1) = 0.
The CTC loss function can be divided into three stages: 1) Initialisation of log prob arrays 2) Looping over the CTC Markov chain process 3) Extraction of per-sequence loss and CTC reduction ("mean" reduction)
Our attention is focused on the CTC Markov chain process: https://github.com/sanchit-gandhi/seq2seq-speech/blob/74c4e3650ce1865a61dbf5e99c199539d6815bfc/run_flax_speech_recognition_ctc.py#L638-L659
When we remove the lax backend in the CTC loss function and use some print statements, we see that positive values creep into the log probabilities by the 4th loop of the CTC algorithm. They first occur in the 'phi-to-emit transition' with the 'next_emit' log probabilities, and then cascade to all further log probs. https://github.com/sanchit-gandhi/seq2seq-speech/blob/main/tests/check_negative_loss_ctc.ipynb
Changing the value of the log_epsilon hyperparameter to a more negative value does not alter this behaviour.

sanchit-gandhi commented 2 years ago

Found it! When we define the model: https://github.com/sanchit-gandhi/seq2seq-speech/blob/b1bf2c2148910d59fd8ba3f0086244e0879a65b7/run_flax_speech_recognition_ctc.py#L849-L856 We need to set the config attribute vocab_size to the number of elements in the tokenizer's vocabulary. Otherwise, it will default to the vocab_size for the Wav2Vec2-large-lv60 checkpoint, which is defined as the vocab size of the default Wav2Vec2 tokenizer built on Librispeech ASR. If the actual tokenizer's vocab size is greater than that of the default Wav2Vec2 tokenizer, we'll have logits that span over a partial sub-space of the full tokenizer vocabulary. These ill-defined logits then (likely) give rise to an ill-defined CTC loss function.

patrickvonplaten commented 2 years ago

Great catch! Due to this the tokenizer converts too many letters to tokens which surely messes up the CTC loss. You're exatly right we should add a vocab_size=len(tokenizer) here

sanchit-gandhi / seq2seq-speech

Negative Losses in CTC Training #28