savbell / whisper-writer

💬📝 A small dictation app using OpenAI's Whisper speech recognition model.
GNU General Public License v3.0
244 stars 40 forks source link

Cannot install on Python version 3.12 #15

Open qwertie opened 7 months ago

qwertie commented 7 months ago

So I installed Python 3.11 (on Windows) just to run this app, and I set up the venv like this:

$ py -3.11 -m venv venv

$ ./venv/Scripts/activate

But I'm not experienced at python and it's apparently still running on python 3.12?

$ pip install -r requirements.txt

Collecting numba==0.57.0 (from -r requirements.txt (line 26))
  Using cached numba-0.57.0.tar.gz (2.5 MB)
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'error'
  error: subprocess-exited-with-error

  Getting requirements to build wheel did not run successfully.
  exit code: 1

    File "<string>", line 48, in _guard_py_ver
  RuntimeError: Cannot install on Python version 3.12.1; only versions >=3.8,<3.12 are supported.

$ python --version
Python 3.12.1

Do you know what to do?

savbell commented 6 months ago

Hi there,

I can't be completely sure what the issue is, but when similar issues have occurred, it's often been related to an improperly set up PATH for the virtual environment. In the venv/Scipts/activate.bat file, check that the VIRTUAL_ENV variable is set correctly. It can be incorrect if you have special characters in the path, or if you've moved the folder away from where you originally created the virtual environment.

Please let me know if this appears to be correct and we can try and troubleshoot further. Thanks! :)

qwertie commented 6 months ago

Hey, thanks for nudging me to debug the issue! The problem was that if you're running bash on Windows, then you have to follow the "Linux" instruction of running source ./venv/Scripts/activate, rather than just reformatting the Windows command venv\Scripts\activate to bash/Powershell format as I did (./venv/Scripts/activate). If done correctly, the command prompt should write (venv) before the $.

Sometimes I wonder why Python can't be a "normal" programming language that lets users simply run an app...

It stopped working after pulling the latest version because "No module named 'pynput'", but I just had to run pip install -r requirements.txt again, and then python run.py worked.

Can I recommend making "use_api": false the default? Right now, users need to do extra setup step(s) regardless of whether they use the API or not. If "use_api": false is the default then some of your users won't need extra steps. You could also just default to a local model automatically if no API key is configured.

qwertie commented 6 months ago

Btw, since you just changed the default keystroke ― I prefer the original keystroke Ctrl+Alt+Space because Ctrl+Shift+Space is already used in Visual Studio and Visual Studio Code to Show Parameter Information.

"activation_key": "win+z" seems like a good option too (but not win+V or win+T or win+W, as in voice/transcribe/whisper, because they have predefined meanings).

Now that it's working it's fun to think about features that would make a really great transcription app:

Edit: Also would be great to have a list of case-insensitive replacements / regexes, like

"smiley face emoji": "🙂",
"sad face emoji": "😢",
"exclamation mark": "!",
"at sign": "@",
"dollar sign": "$",
"percent sign": "%",
"hat sign": "^", "hat mark": "^",Smiley face emoji sad face emoji 
"ampersand": "&",
"asterisk": "*",
"colon mark": ":",
"semicolon": ";",
"dot mark": ".",
"question mark": "?",
"apostrophe": "'",
"quote mark": "\"",
"backslash": "\\", "back slash": "\\",
"pipe mark": "|",
"slash mark": "/",
"greater than sign": ">",
"less than sign": "<",Asterisk and Percent at Sign Percent Sign 
"greater than or equal to": "≥",
"less than or equal to": "≤",
"backtick": "`",
"Till they mark": "~", "Till the mark": "~", "Till then, Mark": "~",
"plus equals": "+=", "minus equals": "-=",
"times equals": "*=", "divide equals": "/=",
"equals equals": "==", "not equals": "!=",
"question mark question mark equals": "??=",
"forward arrow": "=>", "triple equals": "===",
"plus minus": "±",
"x squared": "x²",
"open parenthesis": "(",
"close parenthesis": ")",
"open curly": "{", "close curly": "{", "open Carly": "{", "close Carly": "}",
"Open Purin": "(", "Open Perrin": "(", "open Perenn": "(", "Open her end": "(",
"Open for Ren": "(", "open Karen": "(", "Open for Ren": "(", 
"Close Purin": ")", "Close Perrin": ")",  "close Perenn": ")", "close her end": ")",
"blows for Ren": ")", "expose Karen": ")", "close Perenn": ")", "blows perenn": ")",

I noticed that it doesn't understand "paren" or "tilde" so I just wrote down several of the recognitions it actually produced. And if I'm talking about what the doctor did to my colon, I don't want : so I figure a special phrase like "colon mark" is a better default for most of these. Alternatively, there could be a second hotkey to transcribe in "programming mode" which could perform additional replacements. I imagine saying something like "pascalcase new widget with prototype paren curly size colon ten close curly close paren semicolon newline" and getting output like NewWidgetWithPrototype({ size: 10 }); \n).

savbell commented 6 months ago

Hi @qwertie, thanks so much for your troubleshooting and your comments! They're very helpful! :)

I originally made this app just for my own personal use after getting extremely frustrated with the built-in Windows speech-to-text and wasn't expecting it to get much attention. It's pretty cool that others are using it and finding it useful too! But I'll be upfront about the fact that I'm not going to be dedicating a large amount of time to maintaining this. I'll probably work on some small features in the immediate future, but I don't currently have the availability to work on a lot. If someone else does though, I'd be happy to start paying attention to PRs and other issues/comments!

To address your points:

Thanks again! I appreciate your excitement for the app :)

qwertie commented 6 months ago

Awesome! Thanks for your response! It was a little surprising that the Whisper people didn't make an app like this themselves, so thanks for stepping up! Sadly though, I'm way behind on my own open-source work so I can't really afford to do an extra project myself.

    "api_key": null, // if null, uses local model
    "api_model": "whisper-1",
    "local_model": "base",

    "language": null,
    "temperature": 0.0,
    "initial_prompt": null,
    "condition_on_previous_text": true,
    "verbose": false,
    "sound_device": null,
    "sample_rate": 16000,
    "silence_duration": 900,
    "writing_key_press_delay": 0.005,
    "remove_trailing_period": false,
    "add_trailing_space": true,
    "remove_capitalization": false,
    "print_to_terminal": true,

    "replacements": {               // available in all modes
        "smiley face emoji": "🙂",
        "sad face emoji": "😢",
    },

    "modes": [{
        "name": "English mode",
        "activation_key": "control+alt+space",
        "language": "en",
        "replacements": {
            "asterisk mark": "*",
            "colon mark": ":",
            "semicolon mark": ";",
            "equals sign": "=",
            "greater than sign": ">",
            "less than sign": "<",
            "dot mark": ".",
            "question mark": "?",
            "slash mark": "/",
            "backtick mark": "`",
            "exclamation mark": "!",
            "dollar sign": "$",
            "percent sign": "%",
            "caret sign": "^",
            "ampersand sign": "&",
            "vertical bar mark": "|",
            "apostrophe mark": "'", "single quotation mark": "'",
            "double quotation marks": "\"",
            "backslash mark": "\\", "back slash mark": "\\",
            "open parenthesis": "(",
            "close parenthesis": ")",
            "open curly brace": "{",
            "close curly brace": "{",
            "carriage return": "\n",
        }
    }, {
        "name": "Spanish mode",
        "activation_key": null, // disabled
        "language": "es"
    }, {
        "name": "Programmer mode",
        "activation_key": "win+shift+P",
        "add_trailing_space": false,
        // btw, the base model starts to understand the spoken words "paren" and "tilde" 
        // if they're in the initial_prompt, which is mind-bending, I mean if it wasn't 
        // trained on the words, how does it know what pronunciation to expect?
        "initial_prompt": "tilde newline class Foo brace newline public void main paren close paren curly",
        "replacements": {
            "space": " ",
            "ampersand": "&", "bitwise and": "&", "and and": "&&",
            "bitwise or": "|", "vertical bar": "|", "or or": "||"
            "asterisk": "*",
            "colon": ":",
            "semicolon": ";",
            "equals": "=",
            "greater than": ">",
            "less than": "<",
            "dot": ".",
            "question mark": "?", "question mark dot": "?.",
            "apostrophe": "'", "single quote": "'",
            "quote": "\"", "double quote": "\"",
            "backslash": "\\", "back slash": "\\",
            "slash mark": "/",
            "backtick": "`", "back tick": "`",
            "tilde": "~",
            "open paren": "(", "paren": "(",
            "close paren": ")",
            "open brace": "{", "open curly": "{", "brace": "{",
            "close brace": "}", "close curly": "{",
            "newline": "\n",
        }
    }]

Finally, it's great that it automatically downloads a model....

100%|#######################################| 139M/139M [00:08<00:00, 17.8MiB/s]

But this happened after recording. I assume it downloads when you call load_model, so I'm going to suggest calling load_model on startup before there is any transcription, partly so that if it has to download anything, the user isn't confused why nothing is happening (if they're not looking at the terminal) or why it's taking so long (if they are). Also, are you sure that whisper.load_model doesn't do anything if the model is already loaded? Reinitializing the model might hurt performance.

Speaking of performance, I expect Whisper would support my NVIDIA GeForce GTX 1660 but it seems to be running on the CPU, so any model bigger than "base" is too slow. Any ideas? While I was looking for Whisper's API documentation, I stumbled upon a Faster Whisper though, so that's cool.