mozilla / DeepSpeech

DeepSpeech is an open source embedded (offline, on-device) speech-to-text engine which can run in real time on devices ranging from a Raspberry Pi 4 to high power GPU servers.
Mozilla Public License 2.0
25.36k stars 3.97k forks source link

Allow to pass in vocabulary dynamically, and re-process #2796

Open benbucksch opened 4 years ago

benbucksch commented 4 years ago

As discussed in Berrrrlin at the conference, I am trying to build a voice assistant. I.e. you can give voice commands to the device, it processes them, and responds with voice. That is, I would like to build the intent parser part and upwards.

However, the voice recognition and the intent parser need to cooperate to give a decent detection rate. Correct detection goes up dramatically, if the system knows that only "turn lights off" is valid and "turn lifes off" is not a valid command. Here are examples:

At least, DeepSpeech did not recognize the opposite: "Turn all lies on" and "Turn all lives off"! That would be really dangerous when integrated into HAL3000!

Right now, DeepSpeech with the pre-trained models is good for English dictation, but exposes very bad detection rates for commands, because I cannot easily limit the vocabulary. If DeepSpeech knew that only "Turn all lights on" and "Turn light in living room on" are valid commands, but "Turn all lives off" is not a valid command, it could consider that in the detection and get dramatically better results, and be a lot more reliable. The difference would be enormous.

One solution would be to create a custom language model, but that doesn't work in practice, because the training is very compute intensive and the commands change all the time.

Here are a few examples of how the commands change dynamically:

  1. The user installs a new voice app, which comes with new commands.
  2. I have variables in almost all commands. Some variables are simple ("living room", "on"/"off") and have very small set of options, but even "living room" is dynamic based on which names the user gave to each room or device.
  3. The variable values change quickly. I cannot recompute the entire language model every time a new artist comes up on Spotify or even in my local song archive for that matter. Even clearer, there are many similarly (but not necessarily identically) named places in the world, and which place I mean is highly dependent on my current position, so the probabilities / weighting / rating is based on my current GPS coordinates, and the probabilities need to be considered in the detection. If I'm in St. Tropez, I probably mean "Sainte Maxime" and not "Saint Maximin", which is 100 km further. They sound very similar, but I am much more likely to mean the one that's close to me. Still, I cannot always take the closer one, in case I really want to go to "Saint Maximin". So, the detector needs consider the probabilities which one I mean depending both on my location and my exact pronunciation. Only DeepSpeech has the voice info, and I need to feed in the probabilities based on the list of places and their proximity and the user habits. In other words, this example makes clear that a cooperation between intent parser and voice recognition is necessary, and that the language model is not the solution, but it needs to be on top of that.
  4. It is similarly difficult with song names, artists and places. Song names are often from a different language, so detection rates will be really bad. Artist names are often creative, e.g. "le Shuuk". There is almost no chance that a generic detector will get the spelling of "Viktor Lazlo" ("c" vs. "k", "z" vs. "s") right. However, if I can pass in an absolute list of limited possibilities, the detector can find the closest match, and given that there's no "victor laslo", it would pick the known "Viktor Lazlo". Additionally, like for places, the probabilities depend on how often the end user asked for a specific artist, or even a music genre. If there are 2 similarly sounding, but different artists, one in Hard Rock and one in Electronics, and I always listen to Electronics, then the latter has a higher probability. This is similar to places and their proximity.

Doing this matching at a step after the voice detection is going to be a) very difficult - it'd basically re-write the word detection that DeepSpeech already does, which is a waste in engineering time and CPU runtime b) lossy, because the information of the probabilities that DeepSpeech has considered during detection is lost at this point, and therefore it is necessarily much worse c) very hacky.

So, what I am thinking is:

  1. Create an API in DeepSpeech to pass in a vocabulary as plain text, so that I can pass in a list of valid commands, and DeepSpeech will pick one of those options - the most likely one, based on the speech detection. The vocabulary should allow variables/placeholders and alternatives, so that I can pass in Turn [lights|] [living room|bed room] [on|off] and Go to {place}. I would presume that's more efficient to process than if I post all permutations. It would accept anything for {place}. Each option should have a probability rating (0 to 1), and that probability would be considered in the detection together with the word detection. This combination of probabilities is exactly what will make the detection strong.
  2. Allow me to re-process the voice with more information. If you detected that the command was Go to {place}, I will fetch a list of places, compute their probability (e.g. "Sainte Maxime" 90%, "Saint Maximin" 50% etc.), and pass that information in. Then, you re-process the voice command with this information. This is how we get a good detection rate of places and songs.
  3. Give an option in DeepSpeech to keep the data for more than 300ms, but more like 10-20 s. Obviously, this will use more RAM, but it's needed to allow the previous point without repeating the processing. Right now, I have to re-process the data based on the wrongly detected string, which is very compute intense and slow, and gives worse detection rates, because information about alternatives has already been dropped.

If I have only the first API, that would already help me tremendously. Not having API 2 (which requires 3) means that the processing time is 50-80% higher, but one development step at a time :) .

API 1 would be reasonably simple from an API standpoint. The model API is not touched, because it's large and expensive to load. It would be a new API that takes this list of possible sentences as strings. I could pass this in together with the audio stream, or preload it similar to the model. The sentences would be passed in as plain text file (which needs to be parsed), or better yet already parsed as an array of "sentences". Each "sentence" is an array and its members are can be either a single literal string ("Turn"), or a list of strings ("living room", "bed room"), or a variable (allows any string).

The implementation is a different matter :). The CTC would need to consider the possible sentences and their probability. This might be a step after the language model is applied, or at the same time with the language model. I don't know enough about the CTC to tell.

roozbehid commented 4 years ago

So does this now work with the new scorer? I am looking for the same feature too.

Marcnuth commented 4 years ago

is there any plans on this? +1 for this feature

flowpoint commented 4 years ago

the functionality is basically available. deepspeech delivers decent accuracy, but with a very general language model, plausible but incorrect sentences happen.

i recommend: you refit kenlm on a specific corpus (commands, and the artists names) it will improve accuracy greatly. language model generality and accuracy are a tradeoff (new artist names but without no refitting the lm) and delivering new models is common (in enterprise)

or else, you can still disable and build your own lm ontop of the acoustic model or finetune it too.

as for interpretation of the language model output: the api (python) already exposes: intermediate decoding multiple candidate transcripts with metadata (confidences) with tokens that can range from letter to multiple words.

see the python api: intermediateDecodeWithMetadata

which should suffice for scripting word disambiguitation, and context based interpretation and more

jschueller commented 4 years ago

would you know where to find an example of this ?

flowpoint commented 4 years ago

no, but retraining Kenlm is well documented (generating an external scorer / custom scorer) and the logic of interpreting the recognized tokens would be custom to a usecase (what to make of the confidence scores and other details).

karishma1526 commented 3 years ago

how to add the domain specific words to vocabulary for deepspeech2 so as to correctly transcript it for ASR??