kensho-technologies / pyctcdecode

A fast and lightweight python-based CTC beam search decoder for speech recognition.
Apache License 2.0
416 stars 89 forks source link

Difficulty seeing meaningful changes with hotword boosting #18

Closed rbracco closed 2 years ago

rbracco commented 2 years ago

I am trying to test hotword boosting on a model meant to diagnose pronunciation mistakes, so the tokens are in IPA (international phonetic alphabet), but otherwise everything should work the same.

I have two related issues.

  1. I'm having trouble getting the hotword to change the result at all, even when using insane hotword weights like 9999999.0. Any ideas why this might be happening?
  2. I can occasionally get the result to change, but I have an example below where the inclusion of a hotword changes a word in the result, but it doesn't output the hotword. Model output before CTCDecode: ðɪs wɪl bi dɪskʌst wɪð ɪndʌstɹi (this will be discussed with industry) Hotword used: dɪskʌsd (changing t for d) Model output after CTCDecode: ðɪs wɪl bi dɪskʌs wɪð ɪndʌstɹi (the t at the end of 'dɪskʌs' disappears)

I didn't think this was possible based on how hotword boosting works? Am I misunderstanding or is this potentially a bug?

Env info

pyctcdecode 0.1.0
numpy 1.21.0
Non BPE model
No LM

Code


# Change from 1 x classes x lengths to length x classes
probabilities = probabilities.transpose(1, 2).squeeze(0)
decoder = build_ctcdecoder(labels)
hotwords = ["wɪd", "dɪskʌsd"]
text = decoder.decode(probabilities.detach().numpy(), hotwords=hotwords, hotword_weight=1000.0)

print(text)
poneill commented 2 years ago

Thanks for this-- would it be possible to share the logit matrix in a gist so we can take a closer look at this?

rbracco commented 2 years ago

Absolutely, please let me know if there's any other way I can help, or if you need it in a different format. Thank you! https://gist.github.com/rbracco/493a7886e4305a0b8021af660ce92884

poneill commented 2 years ago

Thanks, is that the same logit matrix though? I get:

θɹu ʌ sɪɹiz ʌv ɪnfɔɹmʌl ɑʊtɹɪʤ sɛʃʌnz oʊvʌɹ ðʌ nɛkst fju mʌnθs

rbracco commented 2 years ago

Oops, so sorry I forgot I had continued playing around with it, the gist has been edited to contain the proper logits.

poneill commented 2 years ago

now getting: ðʌ fɔɹmæt ɪnsɛpʃʌn dɑkjʌmɛnt ðɪs wik fɔɹ sɪgnʌʧʌɹ

I should expect: ðɪs wɪl bi dɪskʌst wɪð ɪndʌstɹi, right?

rbracco commented 2 years ago

Ugh I'm really sorry about that, the gist has been updated for what will hopefully be the final time.

poneill commented 2 years ago

It may take some futzing with the defaults in order to see good performance for any given use case. For example, if I run:

text = decoder.decode(
    probabilities,
    hotwords=hotwords,
    hotword_weight=100,
    beam_prune_logp=-100,
    token_min_logp=-10
)

I get:

ðɪs wɪd bi dɪskʌsd wɪd ɪndʌstɹi

which is, if not a great decoding, hopefully at least evidence that the hotwords feature is working as intended.

It may be that the chosen defaults for beam_prune_logp and token_min_logp should be different when the user submits hotwords, but it's hard to tell from a single example. Ideally the user would perform a hyperparameter search in order to tune the decoder to their use case. I'm not opposed to adding a convenience function to that effect, provided that we can cover most of what people expect out of such a function, something like:

decoder = pyctcdecode.build_and_tune_decoder_from(
    train_logit_matrices,
    train_transcriptions, 
    alphabet, 
    possible_hotwords, 
    metric='wer',
    tuning_iterations=100
)

@gkucsko wdyt?

rbracco commented 2 years ago

Thanks, this will at least give me some rabbit holes to go down and see if I can tune a decent decoder myself.

gkucsko commented 2 years ago

yes, pyctcdecode relies on non-zero logp to propose a next character (token_min_logp). hotwords will only upweight that suggestion but not propose their own next character. something we could look into adding in the future if there is need for it.