githubharald / CTCDecoder

Connectionist Temporal Classification (CTC) decoding algorithms: best path, beam search, lexicon search, prefix search, and token passing. Implemented in Python.
https://towardsdatascience.com/3797e43a86c
MIT License
817 stars 182 forks source link

Different beam search output using different blankIdx value #19

Closed miqbal23 closed 4 years ago

miqbal23 commented 4 years ago

Hi, I have a question regarding your beam search implementation.

On your ctcBeamSearch method, you put value on blankIdx equals to length of the classes (in this case, the known letters and symbols). But on some other beam-search implementation, they put zero on it.

I tested this using your example, and indeed it differs both in decoded result and how far is it from the ground truth (i'm using CER and WER)

=====Line example (using blankIdx = len(classes))=====
TARGET                  : "the fake friend of the family, like the"
BEAM SEARCH             : "the fak friend of the fomcly hae tC" CER: CER/WER: 0.25714/0.15000
BEAM SEARCH LM          : "the fake friend of the family, lie th" CER: CER/WER: 0.05405/0.03226
=====Line example (using blankIdx=0)=====
TARGET                  : "the fake friend of the family, like the"
BEAM SEARCH             : "the faetker friend of ther foarmnacly,  harse. tHhC." CER: CER/WER: 0.33333/0.22368
BEAM SEARCH LM          : "the fake friend of the family, like the " CER: CER/WER: 0.00000/0.00000

So is there a different case where the blankIdx is not zero? Which value is suitable for beam search decoding?

githubharald commented 4 years ago

The neural network has output neurons for the characters "a", "b", ... and also for the blank, if CTC loss is used. Of course, the order of the neurons does matter, you can't train the neural network to predict an "a" for neuron 0, and then suddenly use its output as prediction for blank. This is what you did in your experiment, which of course does not make sense. Usually, the deep learning framework defines the index of the blank neuron, e.g. TF 1 had it at the last index (this is why I'm using this convention in this repo), in TF 2 they changed it to be 0 by default.