Open benbucksch opened 4 years ago
So does this now work with the new scorer? I am looking for the same feature too.
is there any plans on this? +1 for this feature
the functionality is basically available. deepspeech delivers decent accuracy, but with a very general language model, plausible but incorrect sentences happen.
i recommend: you refit kenlm on a specific corpus (commands, and the artists names) it will improve accuracy greatly. language model generality and accuracy are a tradeoff (new artist names but without no refitting the lm) and delivering new models is common (in enterprise)
or else, you can still disable and build your own lm ontop of the acoustic model or finetune it too.
as for interpretation of the language model output: the api (python) already exposes: intermediate decoding multiple candidate transcripts with metadata (confidences) with tokens that can range from letter to multiple words.
see the python api: intermediateDecodeWithMetadata
which should suffice for scripting word disambiguitation, and context based interpretation and more
would you know where to find an example of this ?
no, but retraining Kenlm is well documented (generating an external scorer / custom scorer) and the logic of interpreting the recognized tokens would be custom to a usecase (what to make of the confidence scores and other details).
how to add the domain specific words to vocabulary for deepspeech2 so as to correctly transcript it for ASR??
As discussed in Berrrrlin at the conference, I am trying to build a voice assistant. I.e. you can give voice commands to the device, it processes them, and responds with voice. That is, I would like to build the intent parser part and upwards.
However, the voice recognition and the intent parser need to cooperate to give a decent detection rate. Correct detection goes up dramatically, if the system knows that only "turn lights off" is valid and "turn lifes off" is not a valid command. Here are examples:
At least, DeepSpeech did not recognize the opposite: "Turn all lies on" and "Turn all lives off"! That would be really dangerous when integrated into HAL3000!
Right now, DeepSpeech with the pre-trained models is good for English dictation, but exposes very bad detection rates for commands, because I cannot easily limit the vocabulary. If DeepSpeech knew that only "Turn all lights on" and "Turn light in living room on" are valid commands, but "Turn all lives off" is not a valid command, it could consider that in the detection and get dramatically better results, and be a lot more reliable. The difference would be enormous.
One solution would be to create a custom language model, but that doesn't work in practice, because the training is very compute intensive and the commands change all the time.
Here are a few examples of how the commands change dynamically:
Doing this matching at a step after the voice detection is going to be a) very difficult - it'd basically re-write the word detection that DeepSpeech already does, which is a waste in engineering time and CPU runtime b) lossy, because the information of the probabilities that DeepSpeech has considered during detection is lost at this point, and therefore it is necessarily much worse c) very hacky.
So, what I am thinking is:
Turn [lights|] [living room|bed room] [on|off]
andGo to {place}
. I would presume that's more efficient to process than if I post all permutations. It would accept anything for{place}
. Each option should have a probability rating (0 to 1), and that probability would be considered in the detection together with the word detection. This combination of probabilities is exactly what will make the detection strong.Go to {place}
, I will fetch a list of places, compute their probability (e.g. "Sainte Maxime" 90%, "Saint Maximin" 50% etc.), and pass that information in. Then, you re-process the voice command with this information. This is how we get a good detection rate of places and songs.If I have only the first API, that would already help me tremendously. Not having API 2 (which requires 3) means that the processing time is 50-80% higher, but one development step at a time :) .
API 1 would be reasonably simple from an API standpoint. The model API is not touched, because it's large and expensive to load. It would be a new API that takes this list of possible sentences as strings. I could pass this in together with the audio stream, or preload it similar to the model. The sentences would be passed in as plain text file (which needs to be parsed), or better yet already parsed as an array of "sentences". Each "sentence" is an array and its members are can be either a single literal string ("Turn"), or a list of strings ("living room", "bed room"), or a variable (allows any string).
The implementation is a different matter :). The CTC would need to consider the possible sentences and their probability. This might be a step after the language model is applied, or at the same time with the language model. I don't know enough about the CTC to tell.