talonhub / community

Voice command set for Talon, community-supported.
MIT License
621 stars 778 forks source link

Talon microphone check GUI #690

Open splondike opened 2 years ago

splondike commented 2 years ago

Users (especially new users) often have issues with recognition accuracy caused by the configuration of their microphone. For example:

  1. There is a lot of background noise and so Talon takes a long time to detect speech has ended, resulting in high latency for their commands.
  2. They are speaking too slowly for their speech.timeout setting, causing their utterance to be segmented incorrectly.
  3. They have the gain too low or too high (causing their voice to be below the noise threshold or clipped and distorted).

I suggest we build a UI in to Talon to help users self-diagnose these issues. It would probably make sense to build this into Talon core in the long term, but in the meantime we should be able to get a similar effect with the 'userspace' APIs. My proposal is as follows.

  1. Register a new mode called 'microphone_check'. A mode is used because we want full control over the available commands.
  2. The command 'microphone check' from command mode enables 'microphone_check' mode and pops open a GUI which asks the user to speak a longer given phrase, e.g. 'this is a microphone check' (this should take less than 3 seconds to say). The GUI also has a 10 second countdown timer displayed.
  3. After 10 seconds has expired the GUI plays back all the recorded audio from the user so they can listen to what Talon hears. After playback concludes the user is swapped back to command mode automatically. All microphone_check voice commands are disabled during playback. The GUI also displays its results even after command mode has been re-entered.

The mode would use a .talon file like this:

mode: user.microphone_check
tag: user.microphone_check_recording
-
this is a microphone check: user.microphone_check_register_result()

this: user.microphone_check_register_breakup()
this is: user.microphone_check_register_breakup()
this is a: user.microphone_check_register_breakup()
this is a microphone: user.microphone_check_register_breakup()

^<phrase>: user.microphone_check_register_misrecognition()

It would also have a .py file registered for the 'pre:phrase' callback so we can get the 'audio_ms' statistic (the length of the audio segmented by the VAD I think). I think we can also extract the raw audio from this (for playback to the user).

The results would be calculated as follows:

  1. If 'audio_ms' is more than say 5 seconds then the VAD is probably not segmenting quickly enough and so we say it's likely there's a lot of background noise. This Youtube video (on higher volume) seems to confuse the VAD a bit on my machine for example: https://www.youtube.com/watch?v=Nbbhz6ovRUc
  2. If user.microphone_check_register_result() gets called and also user.microphone_check_register_breakup()/user.microphone_check_register_misrecognition() is called, then we also suggest background noise.
  3. If user.microphone_check_register_breakup() is called but not user.microphone_check_register_result() then we suggest they're speaking too slowly.
  4. If nothing is called then we say we didn't hear anything, maybe your gain is too low.
  5. If only user.microphone_check_register_misrecognition() is called then we ask them to try again or to listen to their recordings and use them to adjust their microphone placement/gain to make it sound more clear (less background noise, no clipping, good voice volume).

@lunixbochs Does this sound like a worthwhile idea? Also, regarding APIs, is audio_ms the right statistic to use for utterance length, and is there a way of getting all audio recognised by Talon within the 10 second window for playback?

pokey commented 2 years ago

Interesting idea, tho if I've understood correctly, I'd think you'd back-anchor rather than front-anchor your phrase, eg <phrase>$

splondike commented 2 years ago

Maybe it should even be unanchored, I've not entirely thought it through. My thinking was the VAD produces a list of segments, and the ^ anchor matches the start of such a segment. So the pipeline might be "this is ... a microphone check" -> VAD -> ["this is", "a microphone check"]. That in turn would result in user.microphone_check_register_breakup() and then user.microphone_check_register_misrecognition() .

I'm not sure the logic I wrote in the OP was correct, but I think we'd want it to behave like this (where ellipsis is a period of not speaking which causes a VAD segmentation):

Perhaps I want to have a 'second half' for each of my partial phrase matchers, so ("this", "is a microphone check"), ("this is", "a microphone check") etc.

I'd say there'd be a bit of fiddling during implementation. At the moment I'm more interested in if the idea seems plausible and worth doing.

bra1nDump commented 2 years ago

This will be extremely helpful. I work out of the office on certain days and the environment is definitely more noisy. I would likely make use of this sanity check in the beginning of each in office workday. Sometimes I just find myself second guessing why talon does not understand me instead of actually recording myself and listening back.

pokey commented 2 years ago

See also https://github.com/TalonCommunity/Wiki/issues/147

pokey commented 2 years ago

Possibly out of scope for this issue but it would also potentially be useful to have a visual indication if your microphone volume is too low. This might be something that would be a better fit for the talon HUD cc/ @chaosparrot

We'd probably also need some support from talon to figure out whether the microphone level is too low cc/ @lunixbochs