Cannot install on Python version 3.12

So I installed Python 3.11 (on Windows) just to run this app, and I set up the venv like this:

$ py -3.11 -m venv venv

$ ./venv/Scripts/activate

But I'm not experienced at python and it's apparently still running on python 3.12?

$ pip install -r requirements.txt

Collecting numba==0.57.0 (from -r requirements.txt (line 26))
  Using cached numba-0.57.0.tar.gz (2.5 MB)
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'error'
  error: subprocess-exited-with-error

  Getting requirements to build wheel did not run successfully.
  exit code: 1

    File "<string>", line 48, in _guard_py_ver
  RuntimeError: Cannot install on Python version 3.12.1; only versions >=3.8,<3.12 are supported.

$ python --version
Python 3.12.1

Do you know what to do?

Hi there,

I can't be completely sure what the issue is, but when similar issues have occurred, it's often been related to an improperly set up PATH for the virtual environment. In the venv/Scipts/activate.bat file, check that the VIRTUAL_ENV variable is set correctly. It can be incorrect if you have special characters in the path, or if you've moved the folder away from where you originally created the virtual environment.

Please let me know if this appears to be correct and we can try and troubleshoot further. Thanks! :)

Hey, thanks for nudging me to debug the issue! The problem was that if you're running bash on Windows, then you have to follow the "Linux" instruction of running source ./venv/Scripts/activate, rather than just reformatting the Windows command venv\Scripts\activate to bash/Powershell format as I did (./venv/Scripts/activate). If done correctly, the command prompt should write (venv) before the $.

Sometimes I wonder why Python can't be a "normal" programming language that lets users simply run an app...

It stopped working after pulling the latest version because "No module named 'pynput'", but I just had to run pip install -r requirements.txt again, and then python run.py worked.

Can I recommend making "use_api": false the default? Right now, users need to do extra setup step(s) regardless of whether they use the API or not. If "use_api": false is the default then some of your users won't need extra steps. You could also just default to a local model automatically if no API key is configured.

Btw, since you just changed the default keystroke ― I prefer the original keystroke Ctrl+Alt+Space because Ctrl+Shift+Space is already used in Visual Studio and Visual Studio Code to Show Parameter Information.

"activation_key": "win+z" seems like a good option too (but not win+V or win+T or win+W, as in voice/transcribe/whisper, because they have predefined meanings).

Now that it's working it's fun to think about features that would make a really great transcription app:

Sometimes it won't stop recording even if I close the little window that says "WhisperWriter recording". So, it should always stop when that window is closed or if the hotkey is pressed a second time.
Could you make it do "push to talk" detection? What I mean is, if I hold down the hotkey (for more than, say, ~900ms) then it should stop recording the instant I let go of the hotkey (and not before). This way, users wouldn't have to wait for it to detect silence. Also, sometimes I'm not sure what exactly I want to say, so I will naturally pause in the middle of speaking a sentence.
Could you do pipelining, meaning that you start feeding audio into Whisper before the input is finished, so the output appears faster?
The next step up from pipelining is a Wake Word: the app listens constantly, and when it hears "Whisper" it automatically starts transcribing.
Sometimes it thinks I'm speaking Korean (?). Can you nudge, or force, Whisper to understand input as being only certain languages?
The "Korean" thing is part of a broader issue that I suspect it has trouble understanding me because it doesn't know the context. Can the model be kept in a "primed" state, with text or audio from previous transcriptions, in order to give it expectations about what it might hear in the near future? In other words, if it just heard me use English astrophysics terminology in a sentence, it will probably recognize another English astrophysics phrase more easily afterward (Of course I might need a better microphone, but I expect AIs to perform better with context in general.)

Edit: Also would be great to have a list of case-insensitive replacements / regexes, like

"smiley face emoji": "🙂",
"sad face emoji": "😢",
"exclamation mark": "!",
"at sign": "@",
"dollar sign": "$",
"percent sign": "%",
"hat sign": "^", "hat mark": "^",Smiley face emoji sad face emoji 
"ampersand": "&",
"asterisk": "*",
"colon mark": ":",
"semicolon": ";",
"dot mark": ".",
"question mark": "?",
"apostrophe": "'",
"quote mark": "\"",
"backslash": "\\", "back slash": "\\",
"pipe mark": "|",
"slash mark": "/",
"greater than sign": ">",
"less than sign": "<",Asterisk and Percent at Sign Percent Sign 
"greater than or equal to": "≥",
"less than or equal to": "≤",
"backtick": "`",
"Till they mark": "~", "Till the mark": "~", "Till then, Mark": "~",
"plus equals": "+=", "minus equals": "-=",
"times equals": "*=", "divide equals": "/=",
"equals equals": "==", "not equals": "!=",
"question mark question mark equals": "??=",
"forward arrow": "=>", "triple equals": "===",
"plus minus": "±",
"x squared": "x²",
"open parenthesis": "(",
"close parenthesis": ")",
"open curly": "{", "close curly": "{", "open Carly": "{", "close Carly": "}",
"Open Purin": "(", "Open Perrin": "(", "open Perenn": "(", "Open her end": "(",
"Open for Ren": "(", "open Karen": "(", "Open for Ren": "(", 
"Close Purin": ")", "Close Perrin": ")",  "close Perenn": ")", "close her end": ")",
"blows for Ren": ")", "expose Karen": ")", "close Perenn": ")", "blows perenn": ")",

I noticed that it doesn't understand "paren" or "tilde" so I just wrote down several of the recognitions it actually produced. And if I'm talking about what the doctor did to my colon, I don't want : so I figure a special phrase like "colon mark" is a better default for most of these. Alternatively, there could be a second hotkey to transcribe in "programming mode" which could perform additional replacements. I imagine saying something like "pascalcase new widget with prototype paren curly size colon ten close curly close paren semicolon newline" and getting output like NewWidgetWithPrototype({ size: 10 }); \n).

Hi @qwertie, thanks so much for your troubleshooting and your comments! They're very helpful! :)

I originally made this app just for my own personal use after getting extremely frustrated with the built-in Windows speech-to-text and wasn't expecting it to get much attention. It's pretty cool that others are using it and finding it useful too! But I'll be upfront about the fact that I'm not going to be dedicating a large amount of time to maintaining this. I'll probably work on some small features in the immediate future, but I don't currently have the availability to work on a lot. If someone else does though, I'd be happy to start paying attention to PRs and other issues/comments!

To address your points:

Good point about setting the API use to false by default. I'll make that change in my next commit.
I changed the default key to accommodate people using Macs (who don't have an alt key), but that's a good point that the new shortcut conflicts with VS code keybinds. I'll think about it more and change it again in my next commit. Of course, you're welcome to change it to anything you'd like locally!
I'll have to do some debugging to see why it doesn't stop recording sometimes. Is it a specific, reproducible issue or does it seem to occur randomly?
"Push to talk" is a great suggestion and shouldn't be too hard to implement, so I might look into it soon! The reason I didn't take this approach is because I use speech-to-text to overcome issues with my wrists that prevent me from typing on a keyboard normally, and holding down keys for a long time is painful for me. But it's definitely a use case for others so worth thinking about!
I'll have to investigate how much additional work pipelining and a wake word would be, and maybe check out some other apps to see how they do it. I'm not sure if I'll have the availability for this one.
There is a way to set the default language in src\config! Under the api_options and local_model_options. It's in ISO-639-1 format, so for English, you can set it to en.
You can also prime it with the initial_prompt configuration setting! Here's some info on how it works. I'll add this link to the README in my next commit too.
I do plan on adding some more post-processing soon, which could include custom RegEx replacements! I'm also looking into using GPT for that sort of thing. Stay tuned!
A programming mode would be awesome, but it's a bit beyond the scope right now. I haven't tried it yet, but Copilot Voice might end up doing some of what you're hoping for?

Thanks again! I appreciate your excitement for the app :)

Awesome! Thanks for your response! It was a little surprising that the Whisper people didn't make an app like this themselves, so thanks for stepping up! Sadly though, I'm way behind on my own open-source work so I can't really afford to do an extra project myself.

I see the mac keyboard has control/option/command instead of control/windows/alt ... I'm thinking that the best solution is to accept multiple keybindings so you can have a cross-platform default, e.g. "activation_key": "control+alt+space, control+command+space, win+z" (ignoring keybindings that fail, of course)
It doesn't stop recording, at random, about 15-20% of the time and I can't reproduce on-demand. And this is the same (low-quality) microphone I've used in my remote job for the last three years. [40 minutes later] Ah, once my furnace started, it wouldn't stop recording until my furnace did! So maybe the reason it wouldn't stop 20% of the time is that my computer's fan runs about 20% of the time.
The nice thing about using previous output as context (instead of initial_prompt) would be that the model should adapt to new contexts automatically. However, I see there is a "condition_on_previous_text" option that sounds like it ought to have a similar effect? Still, conditioning on previous audio might help it recognize people with unusual accents. I wonder if you would get this sort of context-adaptive behavior automatically by operating in pipelining/streaming mode. I know I saw someone using Whisper in a pipelining/streaming mode where it produced output before the input was complete ― this was nontrivial, because if I remember correctly, Whisper could change its mind, i.e. back up and change previously-written words. Regardless, the basic idea is to not reset the model between inputs... you send in the first input, then you call the flush method (if there's no such thing as a flush method, you instead send it a buffer of 2 seconds of artificial silence, to convince the model to send back the final voice recognition output). Then when the user starts a new phrase, the model should still have some memory of the previous audio encoded in its own state, which should help it recognize the next input more reliably.
However, in fact when it detects "Korean" this usually just means that I didn't start talking fast enough and whisper-writer cut off the recording before I said anything... Whisper understands silence as Korean? 🤷‍♂️ Getting the timing right can be tricky, because I also have to wait for the WhisperWriter window to appear before I can start talking, right?
I mistakenly set the language under "api_options" and it had no effect, of course. Is there a reason to have two separate option sets? The "model" would vary between api and local, but less likely the language or temperature or initial_prompt.
I'd actually like it to recognize both Spanish and English, but it doesn't accept "en,es". I guess you're limited by what openai.Audio.transcribe accepts though. Speaking of which, it's weird that I can't find the API documentation with Google... I eventually found this but it has a link to an "API reference" which (i) is for an HTTP API, not a Python API, and (ii) doesn't mention the condition_on_previous_text or initial_prompt options.
I am on the waitlist for GitHub voice; they haven't let me in. It makes sense that a "complete" programming mode would be out of scope, but how about supporting multiple modes via multiple hotkeys? I could duplicate the folder and run two copies at once, but I guess it would consume double the memory in that case. So what if the config.json had a per-mode section, like this?

    "api_key": null, // if null, uses local model
    "api_model": "whisper-1",
    "local_model": "base",

    "language": null,
    "temperature": 0.0,
    "initial_prompt": null,
    "condition_on_previous_text": true,
    "verbose": false,
    "sound_device": null,
    "sample_rate": 16000,
    "silence_duration": 900,
    "writing_key_press_delay": 0.005,
    "remove_trailing_period": false,
    "add_trailing_space": true,
    "remove_capitalization": false,
    "print_to_terminal": true,

    "replacements": {               // available in all modes
        "smiley face emoji": "🙂",
        "sad face emoji": "😢",
    },

    "modes": [{
        "name": "English mode",
        "activation_key": "control+alt+space",
        "language": "en",
        "replacements": {
            "asterisk mark": "*",
            "colon mark": ":",
            "semicolon mark": ";",
            "equals sign": "=",
            "greater than sign": ">",
            "less than sign": "<",
            "dot mark": ".",
            "question mark": "?",
            "slash mark": "/",
            "backtick mark": "`",
            "exclamation mark": "!",
            "dollar sign": "$",
            "percent sign": "%",
            "caret sign": "^",
            "ampersand sign": "&",
            "vertical bar mark": "|",
            "apostrophe mark": "'", "single quotation mark": "'",
            "double quotation marks": "\"",
            "backslash mark": "\\", "back slash mark": "\\",
            "open parenthesis": "(",
            "close parenthesis": ")",
            "open curly brace": "{",
            "close curly brace": "{",
            "carriage return": "\n",
        }
    }, {
        "name": "Spanish mode",
        "activation_key": null, // disabled
        "language": "es"
    }, {
        "name": "Programmer mode",
        "activation_key": "win+shift+P",
        "add_trailing_space": false,
        // btw, the base model starts to understand the spoken words "paren" and "tilde" 
        // if they're in the initial_prompt, which is mind-bending, I mean if it wasn't 
        // trained on the words, how does it know what pronunciation to expect?
        "initial_prompt": "tilde newline class Foo brace newline public void main paren close paren curly",
        "replacements": {
            "space": " ",
            "ampersand": "&", "bitwise and": "&", "and and": "&&",
            "bitwise or": "|", "vertical bar": "|", "or or": "||"
            "asterisk": "*",
            "colon": ":",
            "semicolon": ";",
            "equals": "=",
            "greater than": ">",
            "less than": "<",
            "dot": ".",
            "question mark": "?", "question mark dot": "?.",
            "apostrophe": "'", "single quote": "'",
            "quote": "\"", "double quote": "\"",
            "backslash": "\\", "back slash": "\\",
            "slash mark": "/",
            "backtick": "`", "back tick": "`",
            "tilde": "~",
            "open paren": "(", "paren": "(",
            "close paren": ")",
            "open brace": "{", "open curly": "{", "brace": "{",
            "close brace": "}", "close curly": "{",
            "newline": "\n",
        }
    }]

Finally, it's great that it automatically downloads a model....

100%|#######################################| 139M/139M [00:08<00:00, 17.8MiB/s]

But this happened after recording. I assume it downloads when you call load_model, so I'm going to suggest calling load_model on startup before there is any transcription, partly so that if it has to download anything, the user isn't confused why nothing is happening (if they're not looking at the terminal) or why it's taking so long (if they are). Also, are you sure that whisper.load_model doesn't do anything if the model is already loaded? Reinitializing the model might hurt performance.

Speaking of performance, I expect Whisper would support my NVIDIA GeForce GTX 1660 but it seems to be running on the CPU, so any model bigger than "base" is too slow. Any ideas? While I was looking for Whisper's API documentation, I stumbled upon a Faster Whisper though, so that's cool.

savbell / whisper-writer

Cannot install on Python version 3.12 #15