parlance / ctcdecode

PyTorch CTC Decoder bindings
MIT License
829 stars 245 forks source link

Num_time_steps calculation for batch inputs is wrong #54

Closed ankitmundada closed 6 years ago

ankitmundada commented 6 years ago

When using ctcdecode with sequential data of variable output lengths, the smaller outputs are generally padded with zeros to compensate for the extra size of the largest sample. So, logically, when the ctc_beam_search_decoder loops through the timesteps of probs_seq at Link for code, it should stop at the timestep corresponding to the actual size of that sample's output instead of the length of the probs_seq, since probs_seq also has extra padding in batch mode. This causes in ctcdecode to add extra garbage characters at the end of its actual output.

Examples of such outputs are:

Example#1:
Prediction: didn't do before a ooooooh i o o t h o e l l e e e e e e e e e e e e e e o a o o ghx xxx xxx eee e

Reference: didn't god before
Example#2:
Prediction: and it may be a lot of things that are kind of true ornette e e l e e e e e e e e e e e a u n ghx xxx eee et

Reference: and it may be a lot of things that are kind of truer

I am using ctcdecode with the outputs from deepspeech.pytorch

I can think of two possible solutions for this:

  1. Pass the num_time_steps to ctc_beam_search_decoder as an argument: i.e. instead of size_t num_time_steps = probs_seq.size(); at line, it should be size_t num_time_steps = size # which is passed as an argument

  2. Add a check for some impossible probability outputs, such as -1 and break the loop whenever its true. I am currently using this hack in our system, and it seems to work! You can find it here For this to work, the outputs of the DeepSpeech model are changed a bit. The extra timestep values are intentionally set to -1. The changes are here

The transcripts for same examples, after using the second hacky method are:

Example#1:
Prediction: didn't do before
Reference: didn't god before
Example#2:
Prediction: and it may be a lot of things that are kind of true or
Reference: and it may be a lot of things that are kind of truer
ryanleary commented 6 years ago

Thanks for the report and thorough investigation! I'll get a PR put together to address the issue.

ryanleary commented 6 years ago

@ankitmundada could you checkout the PR #55 and see if this rectifies your issue? You will have to change a line in decoder.py to pass the sequence lengths (see https://github.com/SeanNaren/deepspeech.pytorch/pull/239).

ankitmundada commented 6 years ago

@ryanleary I have tested it and it seems to work now! Thanks for the quick update!

ryanleary commented 6 years ago

Closed by #55.