Closed rbracco closed 2 years ago
Currently there's no way to extract character-level timestamps except by naively interpolating the word-level timestamps. This isn't something we've thought much about for ASR, though I could see it being useful for e.g. linguistics research. My concern about accepting a PR for that is that it would likely involve touching performance-critical paths in the decoder that are already pretty tough to work with. Could you say a bit more about the use case for this?
The logit_scores and lm_scores are log probabilities. Bear in mind that for 0 < p < 1, log(p) < 0 :).
The abuse of the term "logit" is regrettable because the logit function logit(p) = log(p / (1 - p)) is only approximately equal to log(p) when p << 1. But this abuse seems to be widespread in the literature, and if everyone is misusing a word then no one is :/
Thank you, that is helpful. Also I understand about the performance stuff.
The use cases are pretty thin. For my application (mispronunciation detection and correction) it involves looking at the other possible characters at that time step. Char level timesteps are implemented in https://github.com/parlance/ctcdecode, but I could also search over the argmaxes of the logits to identify individual timesteps (using the given word boundaries to narrow it down). I'll play around with it and reopen if I do anything interesting. Thanks again.
I have viewed #8 and understand how to extract timestamps, but is there a way to do this for characters instead of words? If not, is there any interest in adding it as a feature? It's something I will likely implement on my own (for non-bpe models) so I could PR as well if desired. Thank you.
Also a random question that probably doesn't warrant it's own issue: What do the logit_score and the lm_score represent? I'm getting negative values for both, is this negative log likelihood?