Specify grammar file SRGS

NicoHood commented 6 years ago

Is there any way to specify a speech recognition grammar? I am sure that deepspeech will work better, if it is used with a grammar. Am I missing something or is this not yet implemented?

A possible grammar file format could be SRGS: https://www.w3.org/TR/speech-grammar/

Additionally a confidence level should indicate how accurate the match is. I've tested this on windows and - sad but true - it works perfectly, offline. I would hope to get something like this with deepspeech.

kdavis-mozilla commented 6 years ago

Currently there is no means to specify a speech recognition grammar. So we will take this as a feature request.

Could you indicate a bit more your use case?

NicoHood commented 6 years ago

Thanks, indeed it is a feature request.

The story of the use case:

I want to control my leds, etc. in my room and I am searching for an offline (linux) solution to accomplish this. I am not willing to send everything to the cloud, not even on demand.

My current solution is snowboy hotword detection which recognizes a generic hotword (like alexa) and then recognizes a personalized hotword. It works good, but not best, as all words are recognize independent from each other.

Mozilla deepspeech does not work for me, as it does not properly recognize my voice, possibly due to the "not so perfect" microphone or other factors. I used the provided sound models etc.

Then I tried the microsoft speech recognition (note: there are 3 apis available, of which only this works properly for me). I built it using the example from here and combined it with the sample further down this page.

The use case

This speech recognition works perfect for a specified grammar. You can say "turn on green LED" or "turn off bedroom lights". It only recognizes word within the grammar, this way it is more precise and will not recognize "right" instead of "light".

The Microsoft speech API also outputs a confidence level (many other APIs do that as well). If you filter out all "low" hits and only recognize "medium" and "high" recognition results you have a very decent voice detection.

If this grammar mechanism could be applied to deepspeech I hope it will get even better at recognizing predefined commands for machine interaction. It will not be used for dictating, but rather for a very precise control of leds/car/etc with a known command set(aka grammar).

breandan commented 6 years ago

Supporting grammars would be a really nice convenience for developers. Most ASR UIs are not dictation based, but specified by a command-based grammar (ie. UIs for visual impairment, telephony, handsfree applications). You want to constrain the vocabulary in certain contexts to avoid spurious detections. While this could be handled by the client side, this would require testing some threshold criteria, searching through the top N probabilities, or implementing some sort of Viterbi search. I would like to use DeepSpeech for idear, but this is currently difficult due to the dependence on a fixed grammar. HMMs still work better, even if the detection accuracy is lower.

AtosNicoS commented 6 years ago

As a first proof of concept and workaround I am using the Levenshtein distance algorithm to compare the deepspeech result with a grammar string. The higher the number, the better the match. It will only work with lower cases, no punktuation and you need to add some code around to get a percentage value depending on the stringlength.

https://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Levenshtein_distance

$ cat levenshtein.sh 
#!/bin/bash

#https://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Levenshtein_distance
function levenshtein {
    if [ "$#" -ne "2" ]; then
        echo "Usage: $0 word1 word2" >&2
    elif [ "${#1}" -lt "${#2}" ]; then
        levenshtein "$2" "$1"
    else
        local str1len=$((${#1}))
        local str2len=$((${#2}))
        local d i j
        for i in $(seq 0 $(((str1len+1)*(str2len+1)))); do
            d[i]=0
        done
        for i in $(seq 0 $((str1len))); do
            d[$((i+0*str1len))]=$i
        done
        for j in $(seq 0 $((str2len))); do
            d[$((0+j*(str1len+1)))]=$j
        done

        for j in $(seq 1 $((str2len))); do
            for i in $(seq 1 $((str1len))); do
                [ "${1:i-1:1}" = "${2:j-1:1}" ] && local cost=0 || local cost=1
                local del=$((d[(i-1)+str1len*j]+1))
                local ins=$((d[i+str1len*(j-1)]+1))
                local alt=$((d[(i-1)+str1len*(j-1)]+cost))
                d[i+str1len*j]=$(echo -e "$del\n$ins\n$alt" | sort -n | head -1)
            done
        done
        echo ${d[str1len+str1len*(str2len)]}
    fi
}

levenshtein "$1" "$2"

---

$ arecord -d 10 -f S16_LE -r 16000 test.wav || ./levenshtein.sh "$(deepspeech output_graph.pb alphabet.txt test.wav 2>/dev/null)" "der apfel ist rot und die banane gelb"
Recording WAVE 'test.wav' : Signed 16 bit Little Endian, Rate 16000 Hz, Mono
^CAborted by signal Interrupt...
8

slavaGanzin commented 6 years ago

@NicoHood I think if you will use small set of sentences(e.g. grammar) as input of language model and train already trained model(or even new one) with your grammar it will overfit very fast. E.g. will memorize your voice to grammar entry (command) link. Which, in contrary to general tendency, is what you want.

Then you should increase language model weight in transcriber module.

Creating language model is discussed here (look for step 1): https://discourse.mozilla.org/t/tutorial-how-i-trained-a-specific-french-model-to-control-my-robot/22830

mozilla / DeepSpeech

Specify grammar file SRGS #1290

The story of the use case:

The use case