Closed ntaiblum closed 2 years ago
Thanks for the heads up, can reproduce locally. looks like a bug, will look into it!
good catch thanks, will be updated with https://github.com/kensho-technologies/pyctcdecode/pull/43
fixed with v0.3.0
, see here: https://github.com/kensho-technologies/pyctcdecode/releases/tag/v0.3.0
Thanks for the update!
Hi,
I've been testing using pyctcdecode with nemo models. We're mainly interested in getting the timestamps of the specific words while using conformers, and your implementation of this seems very useful for that!
However, it seems that when using nemo models, many words don't have their timestamps recognized properly.
When loading our model and decoding using this for example:
we get a lot of words that have -1 values for the start or end indices (mostly for the start index). On a benchmark we did, about 30% of the start indices were not recognized, and around 0.5% of the end indices. This is despite the fact that the overall performance was quite good with 13.23 WER (before using a language model).
When using this however:
asr_model = nemo_asr.models.EncDecCTCModel.from_pretrained(model_name='QuartzNet15x5Base-En')
all the words in the benchmark have values for the start and end (no -1 at all).The problem is reproducible with other pre-trained models, for example:
asr_model = nemo_asr.models.EncDecCTCModelBPE.from_pretrained(model_name='stt_en_conformer_ctc_small')
also have missed indices.The word output of these models is below.
Your input would be very appreciated. Thanks!
asr_model = nemo_asr.models.EncDecCTCModel.from_pretrained(model_name='QuartzNet15x5Base-En')
[('hello', (78, 88)), ('this', (110, 115)), ('is', (120, 123)), ('a', (128, 129)), ('teft', (133, 142)), ('recording', (150, 170)), ('to', (185, 188)), ('tap', (194, 200)), ('the', (214, 217)), ('new', (222, 227)), ('application', (238, 265)), ('the', (452, 455)), ('number', (461, 472)), ('ieve', (484, 489)), ('one', (514, 519)), ('to', (524, 527)), ('three', (539, 547)), ('four', (560, 566)), ('three', (596, 603)), ('to', (612, 615)), ('one', (628, 634)), ('thank', (691, 697)), ('you', (702, 705)), ('and', (713, 717)), ('goodbyne', (723, 740))]
asr_model = nemo_asr.models.EncDecCTCModelBPE.from_pretrained(model_name='stt_en_conformer_ctc_small')
[('hello', (38, 51)), ('this', (-1, 56)), ('is', (-1, 59)), ('a', (-1, 62)), ('test', (65, 71)), ('recording', (75, 88)), ('to', (-1, 93)), ('test', (97, 103)), ('the', (-1, 107)), ('new', (110, 114)), ('application', (116, 132))]
asr_model = nemo_asr.models.EncDecCTCModelBPE.restore_from('our_model_path')
[('hello', (39, 51)), ('this', (-1, 57)), ('is', (-1, 60)), ('a', (-1, 63)), ('test', (66, 73)), ('recording', (76, 89)), ('to', (-1, 94)), ('test', (97, 104)), ('the', (-1, 107)), ('new', (111, 115)), ('application', (117, 223)), ('the', (-1, 226)), ('number', (228, 240)), ('is', (-1, 252)), ('one', (256, 259)), ('two', (260, 266)), ('three', (268, 276)), ('four', (278, 295)), ('three', (297, 303)), ('two', (304, 311)), ('one', (315, 341)), ('thank', (343, 348)), ('you', (-1, 354)), ('and', (-1, 358)), ('goodbye', (361, 371))]