savbell / whisper-writer

💬📝 A small dictation app using OpenAI's Whisper speech recognition model.
GNU General Public License v3.0
351 stars 54 forks source link

Complete Rewrite #61

Open dariox1337 opened 2 months ago

dariox1337 commented 2 months ago

This is almost a different program that happens to use WhisperWriter assets. Not sure if you're interested in merging it, but just not to be called ungrateful, I'm opening this pull request. My motivation was to add profiles (multiple transcription setups that can be switched by a dedicated shortcut), and while doing so, I decided to restructure the whole program flow. Here is the design doc that gives a high-level overview.

Key Features

P.S. 99.9% of code is generated by AI.

UPDATE: Just for anyone interested, my fork now supports streaming transcription with Faster Whisper and VOSK (new backend). GUI has been updated to PyQt6, and python dependencies updated to support python 3.12.

oyhel commented 2 months ago

I just stumbled upon this project and then this rewrite. Kudos to both, very useful project, very easy to get up and running! Some suggestions/questions for improvements.

  1. would it be possible to add the ID of an openAI assistant to perform post-processing as part of the transcription? I understand this is possible using the scripts-function, but I assume this means the text would need to be passed back and forth top OpenAI?
  2. For a similar post-processing implementation using local Ollama etc., I assume passing the data to Ollama for post-processing using a separate script would be the preferred approach?
dariox1337 commented 2 months ago

I'll speak only for my implementation.

  1. I'm not that familiar with OpenAI API. Does its Whisper API provide an option to redirect transcription result to an assistant on the server side without sending you the raw transcription?
    • If yes, it can be implemented. All you need to do is implement the logic in openai backend Additional config parameters need to be added to config_schema.yaml.
    • If no, it can be implemented with post-processing scripts. Here is a quick example (NOT TESTED):
      
      import openai
      from post_processing_base import PostProcessor

class Processor(PostProcessor): def init(self):

Initialize OpenAI API key

    # In a real-world scenario, you'd want to load this from a secure config
    openai.api_key = 'your-api-key-here'

def process(self, text: str) -> str:
    try:
        response = openai.ChatCompletion.create(
            model="gpt-3.5-turbo",
            messages=[
                {"role": "system", "content": "You are a helpful assistant that corrects transcription errors."},
                {"role": "user", "content": f"Please fix any obvious mistakes in this transcribed text, maintaining the original meaning: '{text}'"}
            ]
        )

        corrected_text = response.choices[0].message['content'].strip()
        return corrected_text
    except Exception as e:
        print(f"Error in AI correction: {str(e)}")
        # If there's an error, return the original text
        return text
Just save this script under a new name in scripts and it'll appear in the list of post-processing scripts.

2. Ollama also can be implemented very easily. Here is a possible implementation (NOT TESTED):

import requests import json from post_processing_base import PostProcessor

class Processor(PostProcessor): def init(self): self.api_base = "http://localhost:11434/api" # Default Ollama API address self.model = "llama2" # Or whatever model you're using

def process(self, text: str) -> str:
    try:
        response = requests.post(
            f"{self.api_base}/generate",
            json={
                "model": self.model,
                "prompt": f"Please fix any obvious mistakes in this transcribed text, maintaining the original meaning: '{text}'",
                "stream": False
            }
        )
        response.raise_for_status()  # Raise an exception for bad status codes

        result = response.json()
        corrected_text = result['response'].strip()
        return corrected_text
    except requests.RequestException as e:
        print(f"Error in Ollama API call: {str(e)}")
        return text
    except json.JSONDecodeError as e:
        print(f"Error decoding Ollama API response: {str(e)}")
        return text
    except KeyError as e:
        print(f"Unexpected response format from Ollama API: {str(e)}")
        return text
    except Exception as e:
        print(f"Unexpected error in AI correction: {str(e)}")
        return text

Put it in "scripts" and it'll appear in settings under the file name you choose.

**The only issue is testing this code**. Implementation is very simple. I don't have means to test these things right now.
dariox1337 commented 2 months ago

While working on streaming transcription, I found a very tricky bug with keyboard simulation. Specifically, hotkeys affect simulated key presses.

If you use "ctrl+shift" as the trigger hotkey, and the simulated keyboard tries to type "hello", your programs will register it as ctrl+shift+h, ctrl+shift+e, etc. This is most obvious in streaming mode, but it can be triggered in non-streaming mode as well: you press ctrl+shift to trigger recording, release shift or ctrl while still holding the other key and transcription will begin, the output will be affected by shift or ctrl (whatever you keep holding).

In attempt to fix this, I already tried interacting with /dev/uinput directly, but it looks like Linux input system bunches all modifiers from all keyboards (virtual and physical). I'm looking for a solution and keeping this PR as a draft until I find something.

go-run-jump commented 1 month ago

When I tried to use the Open AI API functionality, it did not work because the audio was barely understandable. After listening to it, I was amazed on what the model could still decipher. There seems to be some missing normalization and conversion.

Using this worked:

        if audio_data.dtype == np.float32 and np.abs(audio_data).max() <= 1.0:
            # Data is already in the correct format
            pass
        elif audio_data.dtype == np.float32:
            # Data is float32 but may not be in [-1, 1] range
            audio_data = np.clip(audio_data, -1.0, 1.0)
        elif audio_data.dtype in [np.int16, np.int32]:
            # Convert integer PCM to float32
            audio_data = audio_data.astype(np.float32) / np.iinfo(audio_data.dtype).max
        else:
            raise ValueError(f"Unsupported audio format: {audio_data.dtype}")
dariox1337 commented 1 month ago

@go-run-jump Fixing OAI API backend is tricky because I can't test it. But I tried saving audio both before and after conversion in the faster whisper backend. Both audio files sounded completely normal. Maybe the issue is with your microphone?

I have two microphones. The one integrated in the laptop chassis records really shitty audio, especially when the fan is spinning fast. That's why I'm using a usb microphone.

Also, you mentioned in another thread that audio sounded faster than real time for you, perhaps it's because you forgot to specify sample rate? Recording is done at 16k by default, if you replay it at 44k it'd be 2.75 faster.

Anyway, I added the changes you proposed. I hope it helps.

tpougy commented 2 weeks ago

Hello @dariox1337, nice work on the rewrite of this already awesome project using an ellegant architecture. Have you tried to pack your fork as a single executable file using something like PyInstaller? If not, do you think that it is possible to do so? Maybe it could be better as a first step to generate an executable that restricts the project to just OAI API functionality and not local models, but I don't acctually know if it would be needed.