Any way to get character level timestamps?

kensho-technologies / pyctcdecode

A fast and lightweight python-based CTC beam search decoder for speech recognition.

Apache License 2.0

416 stars 89 forks source link

Any way to get character level timestamps? #21

Closed rbracco closed 2 years ago

rbracco commented 2 years ago

I have viewed #8 and understand how to extract timestamps, but is there a way to do this for characters instead of words? If not, is there any interest in adding it as a feature? It's something I will likely implement on my own (for non-bpe models) so I could PR as well if desired. Thank you.

Also a random question that probably doesn't warrant it's own issue: What do the logit_score and the lm_score represent? I'm getting negative values for both, is this negative log likelihood?

poneill commented 2 years ago

Currently there's no way to extract character-level timestamps except by naively interpolating the word-level timestamps. This isn't something we've thought much about for ASR, though I could see it being useful for e.g. linguistics research. My concern about accepting a PR for that is that it would likely involve touching performance-critical paths in the decoder that are already pretty tough to work with. Could you say a bit more about the use case for this?

The logit_scores and lm_scores are log probabilities. Bear in mind that for 0 < p < 1, log(p) < 0 :).

poneill commented 2 years ago

The abuse of the term "logit" is regrettable because the logit function logit(p) = log(p / (1 - p)) is only approximately equal to log(p) when p << 1. But this abuse seems to be widespread in the literature, and if everyone is misusing a word then no one is :/

rbracco commented 2 years ago

Thank you, that is helpful. Also I understand about the performance stuff.

The use cases are pretty thin. For my application (mispronunciation detection and correction) it involves looking at the other possible characters at that time step. Char level timesteps are implemented in https://github.com/parlance/ctcdecode, but I could also search over the argmaxes of the logits to identify individual timesteps (using the given word boundaries to narrow it down). I'll play around with it and reopen if I do anything interesting. Thanks again.