parlance / ctcdecode

PyTorch CTC Decoder bindings
MIT License
829 stars 245 forks source link

probs_seq[i].size() not equal to vocabulary.size() #125

Open lzj9072 opened 4 years ago

lzj9072 commented 4 years ago

I have double checked the size, read the source code of ctc_beam_decoder.cpp and I finally find out why this occur. My modeling units are constructed by English word instead of English alphabet. And the python code concatenates the vocabulary as one long string(''.join(vocab)), and passes it to cpp code(const char* labels). So if the vocabulary is ["Hello", "World"], actually it becomes ["H", "e", "l", "l", "o", "W", "o", "r", "l", "d"]. Is there any solution for different modeling units?

faresbs commented 4 years ago

Hey, i have the same situation. did you fix the problem?

PanXiebit commented 4 years ago

@lzj9072 i meet the same situation,and I make some changes of source code. Everything is ok after testing, and it can be suitable for different modeling unit.

This is my repository,

https://github.com/PanXiebit/ctcdecode

The difference between my code and the source code is as follows:

https://github.com/PanXiebit/ctcdecode/commit/a604c93866fb76f0d2e783f78485081b0a943dbf