Open dariox1337 opened 2 months ago
I just stumbled upon this project and then this rewrite. Kudos to both, very useful project, very easy to get up and running! Some suggestions/questions for improvements.
I'll speak only for my implementation.
import openai
from post_processing_base import PostProcessor
class Processor(PostProcessor): def init(self):
# In a real-world scenario, you'd want to load this from a secure config
openai.api_key = 'your-api-key-here'
def process(self, text: str) -> str:
try:
response = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[
{"role": "system", "content": "You are a helpful assistant that corrects transcription errors."},
{"role": "user", "content": f"Please fix any obvious mistakes in this transcribed text, maintaining the original meaning: '{text}'"}
]
)
corrected_text = response.choices[0].message['content'].strip()
return corrected_text
except Exception as e:
print(f"Error in AI correction: {str(e)}")
# If there's an error, return the original text
return text
Just save this script under a new name in scripts and it'll appear in the list of post-processing scripts.
2. Ollama also can be implemented very easily. Here is a possible implementation (NOT TESTED):
import requests import json from post_processing_base import PostProcessor
class Processor(PostProcessor): def init(self): self.api_base = "http://localhost:11434/api" # Default Ollama API address self.model = "llama2" # Or whatever model you're using
def process(self, text: str) -> str:
try:
response = requests.post(
f"{self.api_base}/generate",
json={
"model": self.model,
"prompt": f"Please fix any obvious mistakes in this transcribed text, maintaining the original meaning: '{text}'",
"stream": False
}
)
response.raise_for_status() # Raise an exception for bad status codes
result = response.json()
corrected_text = result['response'].strip()
return corrected_text
except requests.RequestException as e:
print(f"Error in Ollama API call: {str(e)}")
return text
except json.JSONDecodeError as e:
print(f"Error decoding Ollama API response: {str(e)}")
return text
except KeyError as e:
print(f"Unexpected response format from Ollama API: {str(e)}")
return text
except Exception as e:
print(f"Unexpected error in AI correction: {str(e)}")
return text
Put it in "scripts" and it'll appear in settings under the file name you choose.
**The only issue is testing this code**. Implementation is very simple. I don't have means to test these things right now.
While working on streaming transcription, I found a very tricky bug with keyboard simulation. Specifically, hotkeys affect simulated key presses.
If you use "ctrl+shift" as the trigger hotkey, and the simulated keyboard tries to type "hello", your programs will register it as ctrl+shift+h, ctrl+shift+e, etc. This is most obvious in streaming mode, but it can be triggered in non-streaming mode as well: you press ctrl+shift to trigger recording, release shift or ctrl while still holding the other key and transcription will begin, the output will be affected by shift or ctrl (whatever you keep holding).
In attempt to fix this, I already tried interacting with /dev/uinput directly, but it looks like Linux input system bunches all modifiers from all keyboards (virtual and physical). I'm looking for a solution and keeping this PR as a draft until I find something.
When I tried to use the Open AI API functionality, it did not work because the audio was barely understandable. After listening to it, I was amazed on what the model could still decipher. There seems to be some missing normalization and conversion.
Using this worked:
if audio_data.dtype == np.float32 and np.abs(audio_data).max() <= 1.0:
# Data is already in the correct format
pass
elif audio_data.dtype == np.float32:
# Data is float32 but may not be in [-1, 1] range
audio_data = np.clip(audio_data, -1.0, 1.0)
elif audio_data.dtype in [np.int16, np.int32]:
# Convert integer PCM to float32
audio_data = audio_data.astype(np.float32) / np.iinfo(audio_data.dtype).max
else:
raise ValueError(f"Unsupported audio format: {audio_data.dtype}")
@go-run-jump Fixing OAI API backend is tricky because I can't test it. But I tried saving audio both before and after conversion in the faster whisper backend. Both audio files sounded completely normal. Maybe the issue is with your microphone?
I have two microphones. The one integrated in the laptop chassis records really shitty audio, especially when the fan is spinning fast. That's why I'm using a usb microphone.
Also, you mentioned in another thread that audio sounded faster than real time for you, perhaps it's because you forgot to specify sample rate? Recording is done at 16k by default, if you replay it at 44k it'd be 2.75 faster.
Anyway, I added the changes you proposed. I hope it helps.
Hello @dariox1337, nice work on the rewrite of this already awesome project using an ellegant architecture. Have you tried to pack your fork as a single executable file using something like PyInstaller? If not, do you think that it is possible to do so? Maybe it could be better as a first step to generate an executable that restricts the project to just OAI API functionality and not local models, but I don't acctually know if it would be needed.
This is almost a different program that happens to use WhisperWriter assets. Not sure if you're interested in merging it, but just not to be called ungrateful, I'm opening this pull request. My motivation was to add profiles (multiple transcription setups that can be switched by a dedicated shortcut), and while doing so, I decided to restructure the whole program flow. Here is the design doc that gives a high-level overview.
Key Features
P.S. 99.9% of code is generated by AI.
UPDATE: Just for anyone interested, my fork now supports streaming transcription with Faster Whisper and VOSK (new backend). GUI has been updated to PyQt6, and python dependencies updated to support python 3.12.