Open qwertie opened 7 months ago
Hi there,
I can't be completely sure what the issue is, but when similar issues have occurred, it's often been related to an improperly set up PATH
for the virtual environment. In the venv/Scipts/activate.bat
file, check that the VIRTUAL_ENV
variable is set correctly. It can be incorrect if you have special characters in the path, or if you've moved the folder away from where you originally created the virtual environment.
Please let me know if this appears to be correct and we can try and troubleshoot further. Thanks! :)
Hey, thanks for nudging me to debug the issue! The problem was that if you're running bash on Windows, then you have to follow the "Linux" instruction of running source ./venv/Scripts/activate
, rather than just reformatting the Windows command venv\Scripts\activate
to bash/Powershell format as I did (./venv/Scripts/activate
). If done correctly, the command prompt should write (venv)
before the $
.
Sometimes I wonder why Python can't be a "normal" programming language that lets users simply run an app...
It stopped working after pulling the latest version because "No module named 'pynput'", but I just had to run pip install -r requirements.txt
again, and then python run.py
worked.
Can I recommend making "use_api": false
the default? Right now, users need to do extra setup step(s) regardless of whether they use the API or not. If "use_api": false
is the default then some of your users won't need extra steps. You could also just default to a local model automatically if no API key is configured.
Btw, since you just changed the default keystroke ― I prefer the original keystroke Ctrl+Alt+Space because Ctrl+Shift+Space is already used in Visual Studio and Visual Studio Code to Show Parameter Information.
"activation_key": "win+z"
seems like a good option too (but not win+V or win+T or win+W, as in voice/transcribe/whisper, because they have predefined meanings).
Now that it's working it's fun to think about features that would make a really great transcription app:
Edit: Also would be great to have a list of case-insensitive replacements / regexes, like
"smiley face emoji": "🙂",
"sad face emoji": "😢",
"exclamation mark": "!",
"at sign": "@",
"dollar sign": "$",
"percent sign": "%",
"hat sign": "^", "hat mark": "^",Smiley face emoji sad face emoji
"ampersand": "&",
"asterisk": "*",
"colon mark": ":",
"semicolon": ";",
"dot mark": ".",
"question mark": "?",
"apostrophe": "'",
"quote mark": "\"",
"backslash": "\\", "back slash": "\\",
"pipe mark": "|",
"slash mark": "/",
"greater than sign": ">",
"less than sign": "<",Asterisk and Percent at Sign Percent Sign
"greater than or equal to": "≥",
"less than or equal to": "≤",
"backtick": "`",
"Till they mark": "~", "Till the mark": "~", "Till then, Mark": "~",
"plus equals": "+=", "minus equals": "-=",
"times equals": "*=", "divide equals": "/=",
"equals equals": "==", "not equals": "!=",
"question mark question mark equals": "??=",
"forward arrow": "=>", "triple equals": "===",
"plus minus": "±",
"x squared": "x²",
"open parenthesis": "(",
"close parenthesis": ")",
"open curly": "{", "close curly": "{", "open Carly": "{", "close Carly": "}",
"Open Purin": "(", "Open Perrin": "(", "open Perenn": "(", "Open her end": "(",
"Open for Ren": "(", "open Karen": "(", "Open for Ren": "(",
"Close Purin": ")", "Close Perrin": ")", "close Perenn": ")", "close her end": ")",
"blows for Ren": ")", "expose Karen": ")", "close Perenn": ")", "blows perenn": ")",
I noticed that it doesn't understand "paren" or "tilde" so I just wrote down several of the recognitions it actually produced.
And if I'm talking about what the doctor did to my colon, I don't want :
so I figure a special phrase like "colon mark" is a better default for most of these. Alternatively, there could be a second hotkey to transcribe in "programming mode" which could perform additional replacements. I imagine saying something like "pascalcase new widget with prototype paren curly size colon ten close curly close paren semicolon newline" and getting output like NewWidgetWithPrototype({ size: 10 }); \n
).
Hi @qwertie, thanks so much for your troubleshooting and your comments! They're very helpful! :)
I originally made this app just for my own personal use after getting extremely frustrated with the built-in Windows speech-to-text and wasn't expecting it to get much attention. It's pretty cool that others are using it and finding it useful too! But I'll be upfront about the fact that I'm not going to be dedicating a large amount of time to maintaining this. I'll probably work on some small features in the immediate future, but I don't currently have the availability to work on a lot. If someone else does though, I'd be happy to start paying attention to PRs and other issues/comments!
To address your points:
false
by default. I'll make that change in my next commit.src\config
! Under the api_options
and local_model_options
. It's in ISO-639-1 format, so for English, you can set it to en
.initial_prompt
configuration setting! Here's some info on how it works. I'll add this link to the README in my next commit too.Thanks again! I appreciate your excitement for the app :)
Awesome! Thanks for your response! It was a little surprising that the Whisper people didn't make an app like this themselves, so thanks for stepping up! Sadly though, I'm way behind on my own open-source work so I can't really afford to do an extra project myself.
"activation_key": "control+alt+space, control+command+space, win+z"
(ignoring keybindings that fail, of course)initial_prompt
) would be that the model should adapt to new contexts automatically. However, I see there is a "condition_on_previous_text" option that sounds like it ought to have a similar effect? Still, conditioning on previous audio might help it recognize people with unusual accents. I wonder if you would get this sort of context-adaptive behavior automatically by operating in pipelining/streaming mode. I know I saw someone using Whisper in a pipelining/streaming mode where it produced output before the input was complete ― this was nontrivial, because if I remember correctly, Whisper could change its mind, i.e. back up and change previously-written words. Regardless, the basic idea is to not reset the model between inputs... you send in the first input, then you call the flush method (if there's no such thing as a flush method, you instead send it a buffer of 2 seconds of artificial silence, to convince the model to send back the final voice recognition output). Then when the user starts a new phrase, the model should still have some memory of the previous audio encoded in its own state, which should help it recognize the next input more reliably. openai.Audio.transcribe
accepts though. Speaking of which, it's weird that I can't find the API documentation with Google... I eventually found this but it has a link to an "API reference" which (i) is for an HTTP API, not a Python API, and (ii) doesn't mention the condition_on_previous_text
or initial_prompt
options. "api_key": null, // if null, uses local model
"api_model": "whisper-1",
"local_model": "base",
"language": null,
"temperature": 0.0,
"initial_prompt": null,
"condition_on_previous_text": true,
"verbose": false,
"sound_device": null,
"sample_rate": 16000,
"silence_duration": 900,
"writing_key_press_delay": 0.005,
"remove_trailing_period": false,
"add_trailing_space": true,
"remove_capitalization": false,
"print_to_terminal": true,
"replacements": { // available in all modes
"smiley face emoji": "🙂",
"sad face emoji": "😢",
},
"modes": [{
"name": "English mode",
"activation_key": "control+alt+space",
"language": "en",
"replacements": {
"asterisk mark": "*",
"colon mark": ":",
"semicolon mark": ";",
"equals sign": "=",
"greater than sign": ">",
"less than sign": "<",
"dot mark": ".",
"question mark": "?",
"slash mark": "/",
"backtick mark": "`",
"exclamation mark": "!",
"dollar sign": "$",
"percent sign": "%",
"caret sign": "^",
"ampersand sign": "&",
"vertical bar mark": "|",
"apostrophe mark": "'", "single quotation mark": "'",
"double quotation marks": "\"",
"backslash mark": "\\", "back slash mark": "\\",
"open parenthesis": "(",
"close parenthesis": ")",
"open curly brace": "{",
"close curly brace": "{",
"carriage return": "\n",
}
}, {
"name": "Spanish mode",
"activation_key": null, // disabled
"language": "es"
}, {
"name": "Programmer mode",
"activation_key": "win+shift+P",
"add_trailing_space": false,
// btw, the base model starts to understand the spoken words "paren" and "tilde"
// if they're in the initial_prompt, which is mind-bending, I mean if it wasn't
// trained on the words, how does it know what pronunciation to expect?
"initial_prompt": "tilde newline class Foo brace newline public void main paren close paren curly",
"replacements": {
"space": " ",
"ampersand": "&", "bitwise and": "&", "and and": "&&",
"bitwise or": "|", "vertical bar": "|", "or or": "||"
"asterisk": "*",
"colon": ":",
"semicolon": ";",
"equals": "=",
"greater than": ">",
"less than": "<",
"dot": ".",
"question mark": "?", "question mark dot": "?.",
"apostrophe": "'", "single quote": "'",
"quote": "\"", "double quote": "\"",
"backslash": "\\", "back slash": "\\",
"slash mark": "/",
"backtick": "`", "back tick": "`",
"tilde": "~",
"open paren": "(", "paren": "(",
"close paren": ")",
"open brace": "{", "open curly": "{", "brace": "{",
"close brace": "}", "close curly": "{",
"newline": "\n",
}
}]
Finally, it's great that it automatically downloads a model....
100%|#######################################| 139M/139M [00:08<00:00, 17.8MiB/s]
But this happened after recording. I assume it downloads when you call load_model
, so I'm going to suggest calling load_model
on startup before there is any transcription, partly so that if it has to download anything, the user isn't confused why nothing is happening (if they're not looking at the terminal) or why it's taking so long (if they are). Also, are you sure that whisper.load_model
doesn't do anything if the model is already loaded? Reinitializing the model might hurt performance.
Speaking of performance, I expect Whisper would support my NVIDIA GeForce GTX 1660 but it seems to be running on the CPU, so any model bigger than "base" is too slow. Any ideas? While I was looking for Whisper's API documentation, I stumbled upon a Faster Whisper though, so that's cool.
So I installed Python 3.11 (on Windows) just to run this app, and I set up the venv like this:
But I'm not experienced at python and it's apparently still running on python 3.12?
Do you know what to do?