Construction of LM model on the fly.

axchanda commented 5 years ago

Hi @kpu ,

I would like to ask about another question on how to acheive LM modelling on the fly. Like I had tried Google Speech to text, where in you pass certain key phrases during the call as parameters and Google gives priorities to those words (OOV / priority words). I do see we need to have the priority words before hand for training the LM models.But is it possible by some means, to pass the prioirity words at the time of inference and costruct the LM models instantly?? This is in accordance with Deepspeech.

I know it maybe beyond the scope of this repo, but just to have your suggestions to it. It would be a great value add, if you let me know on it. Thanks a lot!

axchanda commented 5 years ago

Any updates on this, please?

kpu commented 5 years ago

Currently language models can only really be built through through lmplz or through the C++ API that program calls internally.
I don't think you want a Kneser-Ney language model here. You want a feature function that happens to be a language model to boost the score of specific words. You need to think about what score those words should have (i.e. the "boost") and create your own feature function. Kneser-Ney is not a magic black box that will solve your problem.

JRMeyer commented 4 years ago

Hi @kpu, I have this exact same question. I'm wondering if you have a more concrete idea on how this could be implemented. It would be a great feature for kenlm + DeepSpeech if we could pass in a list of words in addition to the audio, and the LM would be biased towards words during decoding

kpu commented 4 years ago

@JRMeyer Is this increasing the probability of a unigram? You want a wrapper around the LM that adds a constant for elements of a set?

kpu commented 4 years ago

As an MT person if it wasn't OOV, I'd call this run-time domain adaptation whereby you retrieve relevant sentences then upweight them in training

dophist commented 4 years ago

@JRMeyer This is a typical speech recognition feature. If I understand you correctly, basically you want to up-weight or down-weight a list of "phrases", which may be a brand name, a particular street name in an address, a terminology of a particular domain etc. It serves like a "hot-fix" mechanism, when you don't have "sentence level" data and you don't want to retrain the whole LM. As a "hot-fix", you need to manually tune the imposed up-weight or de-weight to get your desired results.

Here is a paper from google that deals the very same problem you have: https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43819.pdf it's done in ngram lm framework, the enhancement is not in ngram itself(As Ken said), it's in searching(decoding) process.
besides that, you can implement a non-ngram way, you turn the problem into a biasing graph, say: build the phrase list into a compact prefix tree:
- each path from root to leaf represents a phrase that you want to boost
- each arc represents a word(with partial biasing weight)
during decoding, attach a prefix-tree state to search hypothesis. Each time the hypothesis proceed a word, proceed its state in prefix-tree accordingly:
- matched word will proceed one arc
- if unmatched, the state loops back to root node, with weight cancelling previous partial match.
  Then any decoding hypothesis containing "target phrases" will traverse from root and to a leaf successfully, and will get the the boosting. Carefully handling the prefix-tree state inside hypothesis search is the key.

I don't know DeepSpeech well, if anyone plans on such feature, I would like to help.

JRMeyer commented 4 years ago

@kpu -- this can definitely be seen as a run-time domain adaptation, where the target-domain data is limited to small set of key-words, not sentences. It would be more than just increasing unigrams, it would be increasing probs of all ngrams in which the key-words were observed in the LM.

@dophist -- that paper sounds exactly what I'm thinking of... so in this specific case, would it involve changes to kenlm or changes to the deepspeech decoder (or something else?). I think this would be a welcome PR to deepspeech, right @reuben ?

reuben commented 4 years ago

We already depend on OpenFST and use it as a trie structure for restricting the decoder vocabulary, so you could use it for the additional contextual biasing prefix tree and advance both structures simultaneously during decoding.

JRMeyer commented 4 years ago

@reuben, I'm not sure what you mean by "prefix" tree. Are you suggesting making two trie structures (one only containing key words and one containing the original LM), and decoding with both simultaneously? is there any more detail on this approach you could point to (e.g. a paper or implementation)?

reuben commented 4 years ago

@JRMeyer I'm talking about the second implementation suggested by @dophist in the comment above. A prefix tree is the same as a trie. I'm suggesting you could extend the DeepSpeech decoder to load or build the contextual biasing trie, then use it as described above. The tooling (for handling states, arcs, weights) is already integrated in the form of the OpenFST library, which should help.

kpu commented 4 years ago

You can certainly run two language models as features. Or we can make a wrapper that looks like kenlm but does some runtime changes to the probabilities. Let's figure out what you want in terms of quality first then talk datastructures.

kpu / kenlm

Construction of LM model on the fly. #225