Closed louis030195 closed 1 week ago
/attempt #431
with your implementation plan/claim #431
in the PR body to claim the bountyThank you for contributing to mediar-ai/screenpipe!
Add a bounty • Share on socials
Attempt | Started (GMT+0) | Solution |
---|---|---|
🔴 @EzraEllette | Oct 7, 2024, 4:49:13 AM | WIP |
I've heard that a well implemented form of the new Whisper Turbo is very well optimized and significantly faster than realtime, definitely so on my hardware (an M1 Mac). https://github.com/openai/whisper/discussions/2363 https://www.markhneedham.com/blog/2024/10/02/insanely-fast-whisper-running-openai-whisper-turbo-mac/
I've heard that a well implemented form of the new Whisper Turbo is very well optimized and significantly faster than realtime, definitely so on my hardware (an M1 Mac). openai/whisper#2363 https://www.markhneedham.com/blog/2024/10/02/insanely-fast-whisper-running-openai-whisper-turbo-mac/
Certainly, we should prioritize quality. I discovered a notable difference in output quality when using the same Whisper model in Screenpipe compared to processing a full recorded file, rather than the current implementation which uses small chunks. Done some digging and seems that overlap for 2sec and check for repetitive words before finalizing transcript should do the job.
/attempt #431
Algora profile | Completed bounties | Tech | Active attempts | Options |
---|---|---|---|---|
@EzraEllette | 1 mediar-ai bounty | TypeScript, Rust, JavaScript & more |
Cancel attempt |
i noted that the existing whisper-large model only works with english. Could we support different languages or different whisper models as well?
/attempt #431
Algora profile Completed bounties Tech Active attempts Options @EzraEllette 1 mediar-ai bounty
TypeScript, Rust, JavaScript & more Cancel attempt
any update on this?
i want to add diarization using https://github.com/thewh1teagle/pyannote-rs but might overlap with this issue
Diarization is also something I want from screenpipe as well as speaker verification. I meant to cancel my attempt on this since I haven't had the time to work on it, but algora's cancel button doesn't work.
Oh okay, I'll have a look at this issue + diarization, etc then
@louis030195 Are you actively working on this? I will have some time to get started tonight.
@EzraEllette a bit, i did a simple unit test to measure accuracy of short wav files but it's not really fully reflecting real screenpipe usage
https://github.com/mediar-ai/screenpipe/blob/main/screenpipe-audio/tests/accuracy_test.rs
i improved a bit accuracy using normalization of audio from 57% to 62% (while deepgram is 82%) using whisper turbo
been thinking in switching from candle to whisper-cpp (binding) and adding diarization etc. by just copy pasting the code here https://github.com/thewh1teagle/vibe/blob/main/core/src/transcribe.rs
something else we'd like to do in the future also is to be able to stream transcriptions through a websocket for example in the server for different use cases which might affect the architecture but i think main priority rn is to improve quality
i won't have much time left today and tomorrow morning got bunch of calls so feel free to try things
I have been meaning to make a pipe, but i need realtime data.
I'll pull what you have and see what I can do. Also, I'm going to get a baseline on whisper-cpp's accuracy and go from there.
@louis030195 I'm seeing ~5% improvement with spectral subtraction using the last 100ms frame of unknown status. I might try using the last few hundred ms rather than just one, but for now here this is:
I'm implementing dynamic range compression to see if that helps.
not seeing a difference
Strange. Usually I get very good results with whisper large but working with long files (15min+) Have you tried to exceed 30s timewindow with 2 sec overlap?
There are errors in the middle of the transcripts so I am focusing on those through audio preprocessing.
I should mention that I changed the sinc interpolation to cubic, which is drastically slower than linear. I updated my PR to reflect that.
I'm trying some other sampling changes but I'm doubtful that it will improve anything.
Deepgram result: At least we beat deepgram on the last sample 😆
It's worth mentioning that the Levenshtein distance would be lower if we sanitized the transcription output to remove the hallucinations and timestamps.
I think we can assume that if a transcript has two segments with the same timestamp, the shorter section should be removed. Other than that I'm not sure what you want to do with the timestamps.
i think one of the common issue with screenpipe is when someone is speaking then stop then start again in a 30s chunk, whisper will create "Thank you" in silences, that's one thing we should solve somehow through some audio processing hacks i guess
regarding current accuracy metrics i think we could have a second unit test that contains audio recording from screenpipe like 4 and either write the expected transcript manually or use some online transcription service to create the transcript (which make mistake), for current unit test, some of the expected have been done with deepgram which makes a bit of mistake too
honestly even as a human i struggle to transcribe some of the audio recordings sometimes when people have weird accent
also something else we could eventually do is to fix transcript with LLM in real time but I'd expect it a hard task to do it well as it shouldn't take more than 1gb memory and not adding hallucination, not overloading GPU/CPU etc.
another reason i wanted to switch to whisper cpp is that they have more feature like initial prompt:
https://github.com/ggerganov/whisper.cpp/discussions/348
which we could put in screenpipe cli arg and app ui settings like "yo my name is louis sometimes i talk about screenpipe, i have french accent so make sure to take this into account ..."
while candle is really barebones we have to reimplement everything ourselves sadly and we don't have time to turn into AI researcher at this point
i guess diarization would improve a little bit accuracy also by running transcription only on frames that belong to specific voice
some rough thoughts, what do you think are the next steps @EzraEllette ?
@louis030195 I have a couple meetings tonight but I'll give you some information afterwards.
Right now it makes more sense to use tools that have more features and are actively maintained by other developers when possible.
I'll contact you once my meetings are finished.
@EzraEllette
do you want to refactor to always record audio + send chunks for transcription?
also interested if there could be a way to be able to stream audio + transcription through the API for extensions reasons
also the #170 use case is important
@EzraEllette
do you want to refactor to always record audio + send chunks for transcription?
also interested if there could be a way to be able to stream audio + transcription through the API for extensions reasons
also the #170 use case is important
Yes. I want to make that refactor And explore streaming.
adding some context
some user feedback:
some users had issue with language eg #451 but i think #469 would solve it?
diarization: https://github.com/thewh1teagle/pyannote-rs - can probably slightly increase accuracy too
other issues with audio:
on my side i want to prioritize having high quality data infrastructure for audio that works across OSes ideally (MacOS, Windows at least) and UI things is less priority
Speaker Identification and Diarization will be a large undertaking.
Chunking the audio and overlapping is working for now.
Here are some of my thoughts about streaming audio data:
Refactoring audio recording to stream the audio data will enable other parts of screenpipe to ingest audio recording data.
in addition to streaming, instead of cutting off the audio at X seconds, after X seconds, wait for a pause using VAD to start a new chunk. That way transcriptions won't end in the middle of a sentence. Overlap is effective, but may not be necessary with streaming and using VAD to find a good break in the audio.
Speaker Identification and Diarization will be a large undertaking.
Chunking the audio and overlapping is working for now.
Here are some of my thoughts about streaming audio data:
- Refactoring audio recording to stream the audio data will enable other parts of screenpipe to ingest audio recording data.
- in addition to streaming, instead of cutting off the audio at X seconds, after X seconds, wait for a pause using VAD to start a new chunk. That way transcriptions won't end in the middle of a sentence. Overlap is effective, but may not be necessary with streaming and using VAD to find a good break in the audio.
agree, let's not do speaker Identification and diarization for now
agree with streaming
how we record & transcribe now:
definition of done:
possible ways to increase accuracy:
make sure to measure first, then optimise second, not the other way around, no "it looks better after my change", i only trust numbers, thank you
/bounty 300