ideasman42 / nerd-dictation

Simple, hackable offline speech to text - using the VOSK-API.
GNU General Public License v3.0
1.29k stars 108 forks source link

Add support for speech to commands #19

Open omlins opened 3 years ago

omlins commented 3 years ago

PR #17 is a Proof of Concept of how speech to commands could be supported. The idea is the following:

  1. Match the 1st word to a command name stored in a command dictionary (WORD_CMD_MAP) or to the command name reserved for dictation ("type"); retry if no match (resetting everything).
  2. Process depending on the command name:
    • if dictation command: process text as before
    • else: match the command arguments against the command tree dictionary (WORD_CMD_MAP) until a full command is identified; then, launch it and reset nerd-dictation

For this workflow, nerd-dictation should provide a reset function that can be called from within the configuration script (nerd-dictation.py). Moreover, it could be very useful if this reset function accepted a command name which could then be passed further to nerd_dictation_process as an optional argument. When this optional argument is given, then the first step above could be skipped, e.g., one could directly enter into the dictation mode just like before (and avoid that the first word would be the dictation command name which likely influences the statistical natural language prediction negatively). Furthermore, with little modification this would allow also to easily pass freely dictated arguments to certain commands. Finally, the whole workflow enables continuous listening to commands avoiding a reloading of the VOSK model and commands are only emitted if securely identified as the first word (and the following would not be needed: Add '--commands' command line argument to restrict input to a limited set of commands (#3) ).

Now, while this seems all very straightforward, there is one crucial issue to solve in order to enable efficient speech to commands: commands seem to be very badly recognized by the normal VOSK natural language model(s) (at least in my few tests). The model expects as first word, e.g, "hi" and "hello" instead of any random word that we might want to use as command name. As a result, a command name like "right" (right mouseclick) is most of the time recognized as "hi" by the VOSK model. In consequence, I believe it will be necessary to use a different VOSK model for the command recognition then for natural language dictation. I don't know if the "Speaker identification model" (see: https://alphacephei.com/vosk/models) might be of any use; else, one could create a very simple VOSK module based on the command tree dictionary (WORD_CMD_MAP). For technical details, it would certainly help to learn more about how the VOSK model for command recognition was built for this Android app: https://realize.be/blog/offline-speech-text-trigger-custom-commands-android-kaldi-and-vosk https://github.com/alphacep/vosk-api/issues/41

While the creation of a simple VOSK model for command recognition is probably a bit of work, I believe that it would lead to an exceptional model (as it would contain only exactly what should be recognized).

omlins commented 3 years ago

@ideasman42:

ideasman42 commented 3 years ago

@omlins for your second two questions, no - I'm a complete novice at speech-to-text.

For the first question, I had some thoughts about how this could be implemented.

Suggest the following behavior:

omlins commented 3 years ago

Returning None from the user configuration is a signal that the current phrase has been consumed, speech processing should be reset. Support returning a tuple, so you could for example return ("some text", None) to write out the text, then reset text processing.

This way of implementing the reset seems absolutely fine to me. To be able to implement the workflow suggested above, it is important though that persistent user variables like nerd_dictation_process.cmd_name in PR #17 are not affected by the reset of the speech processing; the user should reset his persistent variables himself when needed. This will allow to keep, e.g., the information which kind of command we are processing across a reset as needed for the workflow suggested above. Besides that, I noticed that nerd-dictation errors if the returned text is "". Returning "" should be valid in order to support the case where no text is to return, but no reset is to be done (e.g., when all words part of a command, but the command is still incomplete).

This means the user configuration could handle text how they like, if a command was added in the middle of a sentence, the text could be returned and the command consumed with whatever logic the user configuration chooses.

I wouldn't think it is very important that a command can be added in the middle of a sentence as to my understanding there are several issues with that: 1) the statistical VOSK model "gets confused" and produces to less good speech recognition (any word that occurs randomly in a dictated text certainly "confuses" the model), 2) the recognition of a command in the middle of a sentence is probably also not very good and depends on what precedes it, 3) allowing for commands in the middle of a sentence leads easily to unintended action (which would need to be mitigated by a measure like proposed in #3), and 4) dictating a word that is in the same time a command would need an invention of an escape mechanism.

Thus, I believe it is best to only try to match the first word to a command name (if the command name is not yet known) and if it cannot be matched, then reset immediately and retry to match the first word that comes in... I believe this is a pretty secure mechanism to determine the kind of command the user wants to process... and hopefully, we can do that later with a 2nd VOSK model that is built just for command matching.

To come back to the reset: @ideasman42, can you do the necessary changes to enable the reset mechanism you proposed above? As soon as this is done, I can adapt the PR #17 to use this reset mechanism. :)

ideasman42 commented 3 years ago

@omlins I'd rather not draw distinctions between commands and regular text at the level of nerd-dictation.

If users have the ability to reset text processing - this distinction can be made for their user configuration.

This gives the most flexibility and doesn't make assumptions about how people will use speech input. Further I see no significant downsides in keeping this general - having a distinction between commands and regular text seems like an unnecessary limitation which will prevent people from mixing or adding a regular text with commands in ways neither of us have considered.

One of the reason I didn't want to use any of the existing solutions is they have constraints relating to how text/commands are handled which I rather avoid.


Thus, I believe it is best to only try to match the first word to a command name (if the command name is not yet known) and if it cannot be matched, then reset immediately and retry to match the first word that comes in.

This can be handled by user configuration, as long as text parsing can be reset.


To come back to the reset: @ideasman42, can you do the necessary changes to enable the reset mechanism you proposed above?

Yes, although I'm not sure when exactly I'll be able to get around to it.

omlins commented 3 years ago

I see no significant downsides in keeping this general

@ideasman42, I also don't see any downsides in keeping this general up to now and I am very positive that it can stay that way! I believe the reset mechanism you proposed should work perfectly fine, without loosing any generality. To clarify, my comment

I wouldn't think it is very important that a command can be added in the middle of a sentence as to my understanding there are several issues with that:

  1. (...)

was at this point just to communicate to you my concerns this-related in order to be sure we have the same issues in the back of our head and that I understand your propositions and decisions.

This can be handled by user configuration, as long as text parsing can be reset.

Yes, that's what I have done in PR #17 .

Yes, although I'm not sure when exactly I'll be able to get around to it.

Great! Just to get an idea: are we talking about days or weeks or...?

patricksebastien commented 3 years ago

Would love to use this. I am building a simple foot controller to begin / end nerd-dictation, but if I could also use nerd-dictation to send commands that would be fantastic (ie: ssh to this server, open my ide, run this shell etc...). So I am +1 this feature request.

omlins commented 3 years ago

Would love to use this. I am building a simple foot controller to begin / end nerd-dictation, but if I could also use nerd-dictation to send commands that would be fantastic (ie: ssh to this server, open my ide, run this shell etc...). So I am +1 this feature request.

Yes, I have the same in my mind. :) For this we need though this reset feature to enable a first implementation.

LexiconCode commented 2 years ago

Greetings everyone! Nerd dictation is a neat project. As for adding commands I thought that dragonfly would be an excellent system to integrate to create commands. Dragonfly offers a lot:

Dragonfly currently supports the following speech recognition engines so why not add nerd dictation leveraging vosk-api!:

Quickstart: If you wanted to experiment with the grammar below you don't even need to speech recognition backend installed. Instead you can use the text engine. The text engine allows you to type commands to emulate as if they are being dictated by voice. lowercase mimics commands, UPPERCASE mimics free dictation Upper and lowercase words can be mixed e.g say THIS IS A TEST

  1. To install simply pip install dragonfly2
  2. Simply save the following code as _dragonfly_example.py (could be named anything)
  3. From command line python -m dragonfly test _dragonfly_example.py --delay 2
  4. Type in the prompt "hotel"
  5. After two seconds defined by --delay 2 the following text will appear hotels are not cheap
    • The delay allows user to switch to the relevant application to test commands

Alternatively echo "hotel" | python -m dragonfly test _dragonfly_example.py if you want a one-liner

Example grammar written to explain to non-programmer


from dragonfly import (BringApp, Key, Function, Grammar, Playback, 
                       IntegerRef, Dictation, Choice, WaitWindow, MappingRule, Text, Mouse,)

def my_function(n, text):
    print("put some Python logic here: " + str(text))

class MainRule(MappingRule):
# The file paths/key shortcuts emulation are obviously for Windows but can be tweaked easily any other OS
    mapping = {
    # It is this section that you want to fiddle around with if you're new: mapping, extras, and defaults

    # In the next line, there are two things to observe:
    # the first is the use of parentheses and the pipe symbol (|)
    # --this lets me use either "lock dragon" or "deactivate" to trigger that command.
    # The next is the playback action, which lets me tell Dragon to simulate me speaking some words.
    # if "go to sleep" command is not available it will print out `go to sleep` as text
    '(lock Dragon | deactivate)':   Playback([(["go", "to", "sleep"], 0.0)]),

    # Here I'm using BringApp-- this is the same as typing what goes in between the parentheses
    # Into the Windows command prompt, without the quotes and commas, like:
    # explorer C:\NatLink\NatLink\MacroSystem
    # -- (which would open Windows Explorer at the specified location). Anything you can do with the command line can be done this way
    "open natlink folder":          BringApp("explorer", r"C:\NatLink\NatLink\MacroSystem"),

    # Here I'm using the Key action to press some keys -- see the documentation here: https://dragonfly2.readthedocs.io/en/latest/actions.html?highlight=key#module-dragonfly.actions.action_key
    "remax":                        Key("a-space/10,r/10,a-space/10,x"),

    # Here I'm chaining a bunch of different actions together to do a complex task
    "(show | open) documentation":  BringApp('C:\\Program Files (x86)\\Google\\Chrome\\Application\\chrome.exe') + WaitWindow(executable="chrome.exe") + Key('c-t') + WaitWindow(title="New Tab") + Text('https://dragonfly2.readthedocs.io/en/latest') + Key('enter'),

    # Here I'm just saying one word to trigger some other words
    "hotel":                        Text("hotels are not cheap"),

    # If you need to do more complicated tasks, or use external resources, a function might be what you need.
    # Note that here, I'm using extras: "n" and "text"
    # The angle brackets <> meaning I'm using an extra, and the square brackets [] mean that I don't have to speak that word, it's optional.
    # Advice: if you use an optional extra, like I am with "text", you should set a default value  in the defaults section down below.
    # To trigger the following command, you would have to say the word "function" followed by a number between 1 and 1000.
    '[use] function <n> [<text>]':    Function(my_function, extra={'n', 'text'}),

     # Sometimes it's easier to have things as a list in a command as a choice that do different things.
     # That's what `<choice>` Is defined in `extras` allows you define that list. If you dictate `i choose custom grid` Then `CustomGrid` will be printed as text.
     # Items in the list are pairs. e.g `{"custom grid": "CustomGrid"}` The first item of a pair is the command "custom grid" and the second "CustomGrid" output text action.   
    "i choose <choice>":              Text("%(choice)s"),

    }

    extras = [
              IntegerRef("n", 1, 1000),
              Dictation("text"),
              Choice("choice",
                    {
                    "alarm": "alarm",
                    "custom grid": "CustomGrid", 
                    "element": "e"
                    }),
                ]
    defaults = {
                "n": 1,
                "text": "",
            }

# Create the grammar and the context under which it'll be active for example notepad.
# context = AppContext(executable="notepad")
# grammar = Grammar("notepad example", context=context)

# Add the command rule to the grammar and load it.
# grammar.add_rule(MainRule())
# grammar.load()

# alternatively a global grammar that works everywhere
grammar = Grammar('sample')
grammar.add_rule(MainRule())
grammar.load()

Want to tweak the grammar above or try to build your own using the example above? Try editing and experimenting with your own actions

How each engine is implemented can be obtained from the following: https://github.com/dictation-toolbox/dragonfly/tree/master/dragonfly/engines

ideasman42 commented 2 years ago

Hey @LexiconCode while this sounds interesting what you're talking about seems like it could be a separate project (a small Python command line tool that depends on dragonfly2).

I have some bias towards keeping nerd-dictation simple and hackable, while better support for commands would be nice, it's not a priority for me.

If nerd-dictation becomes a dragonfly wrapper, it means that we're tied to it's model of interpreting speech. Moving away form simply letting the user process text with their own Python script - however they like. This will have advantages of course, but does mean we would need to buy in to dragonfly's methods of text processing.

IMHO there is room for multiple command line speech-to-text tools, a separate command line tool that wraps dragonfly is a legitimate tool that people could use along-side (or instead of) nerd-dictation. I don't see that this necessarily needs to be integrated as a single tool.

The reason I started this project is that every alternative I tried was complicated and difficult to get into (in my experience at least), so I rather keep nerd-dictation relatively simple and let other projects handle additional complexity.


Note that VOSK-SDK uses kaldi internally, so their might not be any gain for supporting VOSK directly.

asamwow commented 2 years ago

how about something simple like #27 ?

omlins commented 2 years ago

@asamwow : for me it has also high priority to now proceed with this issue. I would very much like to get it working nicely before the end of the year... That said, the logic of my WIP PR works fine (I have tested it extensively). Of course, we can then think about esthetics and organize things a bit differently in the end, merging our ideas. However, the fundamental thing that we are missing in order to proceed is the reset mechanism.

To come back to the reset: @ideasman42, can you do the necessary changes to enable the reset mechanism you proposed above? Yes, although I'm not sure when exactly I'll be able to get around to it.

@ideasman42 : would you be able to do that in the next few weeks or, else, could you outline how this should be done such that @asamwow and I can draft a PR? I would appreciate that really a lot!!

asamwow commented 2 years ago

@asamwow and I can draft a PR? I would appreciate that really a lot!!

@ideasman42 Don't know what you are talking about my pr is done LMAO.

In my personal config file, I was able to program this usage, works great.

+# Usage:
+
+# [<single commands>] [<text modifier>] <prose> [space,stop]
+# [<single commands>] <multi-command>
+
+# [<single commands>] : Chain any number of single commands
+# [<text modifier>]   : Modify prose (caps, snake, camel...)
+# <prose>             : Dictate any textual phrase
+# <multi-command>     : Multi-word command, cannot be chained
+# [space,stop]        : Finish input with space or enter key
omlins commented 2 years ago

@ideasman42 Don't know what you are talking about my pr is done LMAO.

@asamwow : we should have a way to trigger a reset of the statistical model. Else, you can sure have a something that "works" in the sense that it will do something, but the statistical model will not work at its best when it considers independent commands as normal connected speech, while in reality the commands are independent "speech". I don't see such a reset of the statistical model anywhere in your propositions. Please point me to it, if you have done that. :)

asamwow commented 2 years ago

hi @omlins Thank you for the feedback. Sincerely appreciated. I guess it might not be finished after all, although I don't completely understand the specifics. If a command is parsed ("next line" for example), the command is completely backspace'd, and then its up to the user's config to take it from there. Any idea how to improve the implementation?

omlins commented 2 years ago

@asamwow , @patricksebastien : I have made a large effort to address this issue seriously. Looking through the source code of nerd-dictation, I figured out that it currently does not do anything that would help for a secure, low latency speech to command translation (besides, the few comments of @ideasman42 on this subject, since I have opened the issue half a year ago, have shown that he was not interested in development in that direction - which is totally fine: his repo - his vision!). As a result, I have created JustSayIt, which enables secure, low latency speech to command translation and is usable as software or API. It implements a novel algorithm for high performance context dependent recognition of spoken commands. Here you can find out more: https://github.com/omlins/JustSayIt.jl

ideasman42 commented 1 year ago

While not exactly command support, the ability to limit works goes a long way towards this functionality: see: 1d0f1fd7f5eecaa61c8f71e045c5480f771b8f75