mozilla / DeepSpeech

DeepSpeech is an open source embedded (offline, on-device) speech-to-text engine which can run in real time on devices ranging from a Raspberry Pi 4 to high power GPU servers.
Mozilla Public License 2.0
25.36k stars 3.97k forks source link

Phrase hints in inference calls #1821

Open pvanickova opened 5 years ago

pvanickova commented 5 years ago

It would be helpful to provide phrase hints (context words) during inference time to boost probability of certain domain specific phrases in the transcription.

E.g. when passing an audio to python api, user could pass a list of likely phrases in the context phrase_hints = ['transverse compound fracture', 'high bp', 'per os'] ds.enableDecoderWithLM(args.alphabet, args.lm, args.trie, LM_ALPHA, LM_BETA, phrase_hints)

kdavis-mozilla commented 5 years ago

Couldn't this be addressed by a custom language model?

pvanickova commented 5 years ago

The context may change dynamically - something that is a context for one inference wouldn't be a context for another one, e.g. different departments are using different terminology, different shops have different inventory, different parts of an app may have different context options, ...

Rebuilding the language model for each case would mean a lot of language models and very frequent update of the models with new phrases.

Plus sufficiently updating a general English language model with just few high probability phrases would require a lot of dummy text generation to assign the phrase enough probability (just guessing about this one).

kdavis-mozilla commented 5 years ago

Good point.

One of the things we are thinking about if the ability to dynamically change language models, see #1678 (Allow use of several decoders (language models) with a single model in the API). Would that be a close enough fit to your use case? (I know you'd still have to create several language models which may be too much of a pain.)

The reason I'm asking is we are trying to decide how to best add just this functionality.

pvanickova commented 5 years ago

I've added my comments for the multiple language model feature in its thread.

Having the option to provide a list of expected phrases for the context still would be very useful in my scenario (pulling subset of hint phrases from a frequently updated dictionary based on the source of the call) .

Once there's a good way to combine probability from multiple language models, this might be implemented as an additional on-the-fly generated mini language model with high probabilities of the injected phrases perhaps?

kdavis-mozilla commented 5 years ago

Thanks!

axchanda commented 5 years ago

's a good way to combine probability from multiple language models, this might be implemented as an additional on-the-fly generated mini language model with high probabilities of the injected phrases perhaps @pvanickova Have you got the required phrase hints done? I am also in search for the same. Please help me out! Thanks!!!!

SephVelut commented 5 years ago

Even with dynamic models, its more accurate to provide context in the form of phrase hints at the time of inference. Because a language model with those phrase hints would apply to each inference, whereas you would rather have certain phrases apply on certain inferences during a session, not all.

nmstoker commented 5 years ago

If #432 is completed, people would be able to experiment with ways of handling hints and context assistance more easily (possibly with a view to then including the more broadly applicable successful ones as part of the API)

I like the hints idea but I think it might be valuable to gather together the distinct kinds of scenarios people want to be able to solve. In some cases distinct LMs make sense (switching between them or in combination, eg to extent vocabulary) and in others hints of specific words or potentially classes of word make sense (eg if you expect a number reply it could be handy to bias in favour of numbers whilst still coping with other kinds of response)

MrityunjoyS commented 4 years ago

I'm also trying to use hinting and substitution methods to rectify errors and improve recognition. I'm using deep speech model only as ASR. I've used deep speech 2 model to build my own pbm and scorer as I'm trying to improvise the ASR for Hindi language. I'm facing issues like while saying "Haa", the model is only catching "a". Need to rectify that, can you please suggest how can I implement 'hints' or 'substitution' for that.