kensho-technologies / pyctcdecode

A fast and lightweight python-based CTC beam search decoder for speech recognition.
Apache License 2.0
416 stars 89 forks source link

decode_beams word timestamps are not always recognized #42

Closed ntaiblum closed 2 years ago

ntaiblum commented 2 years ago

Hi,

I've been testing using pyctcdecode with nemo models. We're mainly interested in getting the timestamps of the specific words while using conformers, and your implementation of this seems very useful for that!

However, it seems that when using nemo models, many words don't have their timestamps recognized properly.

When loading our model and decoding using this for example:

asr_model = nemo_asr.models.EncDecCTCModelBPE.restore_from('our_model_path')
decoder = build_ctcdecoder(asr_model.decoder.vocabulary)
logits = asr_model.transcribe([file_path], logprobs=True)[0]
text = decoder.decode_beams(logits)[0]

we get a lot of words that have -1 values for the start or end indices (mostly for the start index). On a benchmark we did, about 30% of the start indices were not recognized, and around 0.5% of the end indices. This is despite the fact that the overall performance was quite good with 13.23 WER (before using a language model).

When using this however: asr_model = nemo_asr.models.EncDecCTCModel.from_pretrained(model_name='QuartzNet15x5Base-En') all the words in the benchmark have values for the start and end (no -1 at all).

The problem is reproducible with other pre-trained models, for example: asr_model = nemo_asr.models.EncDecCTCModelBPE.from_pretrained(model_name='stt_en_conformer_ctc_small') also have missed indices.

The word output of these models is below.

Your input would be very appreciated. Thanks!

asr_model = nemo_asr.models.EncDecCTCModel.from_pretrained(model_name='QuartzNet15x5Base-En')

[('hello', (78, 88)), ('this', (110, 115)), ('is', (120, 123)), ('a', (128, 129)), ('teft', (133, 142)), ('recording', (150, 170)), ('to', (185, 188)), ('tap', (194, 200)), ('the', (214, 217)), ('new', (222, 227)), ('application', (238, 265)), ('the', (452, 455)), ('number', (461, 472)), ('ieve', (484, 489)), ('one', (514, 519)), ('to', (524, 527)), ('three', (539, 547)), ('four', (560, 566)), ('three', (596, 603)), ('to', (612, 615)), ('one', (628, 634)), ('thank', (691, 697)), ('you', (702, 705)), ('and', (713, 717)), ('goodbyne', (723, 740))]

asr_model = nemo_asr.models.EncDecCTCModelBPE.from_pretrained(model_name='stt_en_conformer_ctc_small')

[('hello', (38, 51)), ('this', (-1, 56)), ('is', (-1, 59)), ('a', (-1, 62)), ('test', (65, 71)), ('recording', (75, 88)), ('to', (-1, 93)), ('test', (97, 103)), ('the', (-1, 107)), ('new', (110, 114)), ('application', (116, 132))]

asr_model = nemo_asr.models.EncDecCTCModelBPE.restore_from('our_model_path')

[('hello', (39, 51)), ('this', (-1, 57)), ('is', (-1, 60)), ('a', (-1, 63)), ('test', (66, 73)), ('recording', (76, 89)), ('to', (-1, 94)), ('test', (97, 104)), ('the', (-1, 107)), ('new', (111, 115)), ('application', (117, 223)), ('the', (-1, 226)), ('number', (228, 240)), ('is', (-1, 252)), ('one', (256, 259)), ('two', (260, 266)), ('three', (268, 276)), ('four', (278, 295)), ('three', (297, 303)), ('two', (304, 311)), ('one', (315, 341)), ('thank', (343, 348)), ('you', (-1, 354)), ('and', (-1, 358)), ('goodbye', (361, 371))]

gkucsko commented 2 years ago

Thanks for the heads up, can reproduce locally. looks like a bug, will look into it!

gkucsko commented 2 years ago

good catch thanks, will be updated with https://github.com/kensho-technologies/pyctcdecode/pull/43

gkucsko commented 2 years ago

fixed with v0.3.0, see here: https://github.com/kensho-technologies/pyctcdecode/releases/tag/v0.3.0

ntaiblum commented 2 years ago

Thanks for the update!