mediar-ai / screenpipe

one API to get all user desktop data (local, cross platform, 24/7, screen, voice, keyboard, mouse, camera recording). sandboxed js plugin system. keyboard and mouse control
https://screenpi.pe
MIT License
10.5k stars 631 forks source link

$300 - improve local transcription accuracy #431

Closed louis030195 closed 1 week ago

louis030195 commented 1 month ago

how we record & transcribe now:

  1. record chunk of audio of 30s on each device
  2. use local voice activity detection model to extract speech frames, if not enough, skip transcription
  3. transcribe speech frames
  4. encode audio to mp4
  5. save transcription + mp4 source to db

definition of done:

possible ways to increase accuracy:

make sure to measure first, then optimise second, not the other way around, no "it looks better after my change", i only trust numbers, thank you

/bounty 300

linear[bot] commented 1 month ago

MED-156 $300 - improve local transcription accuracy

algora-pbc[bot] commented 1 month ago

💎 $300 bounty • Screenpi.pe

Steps to solve:

  1. Start working: Comment /attempt #431 with your implementation plan
  2. Submit work: Create a pull request including /claim #431 in the PR body to claim the bounty
  3. Receive payment: 100% of the bounty is received 2-5 days post-reward. Make sure you are eligible for payouts

Thank you for contributing to mediar-ai/screenpipe!

Add a bounty • Share on socials

Attempt Started (GMT+0) Solution
🔴 @EzraEllette Oct 7, 2024, 4:49:13 AM WIP
TanGentleman commented 1 month ago

I've heard that a well implemented form of the new Whisper Turbo is very well optimized and significantly faster than realtime, definitely so on my hardware (an M1 Mac). https://github.com/openai/whisper/discussions/2363 https://www.markhneedham.com/blog/2024/10/02/insanely-fast-whisper-running-openai-whisper-turbo-mac/

louis030195 commented 1 month ago

I've heard that a well implemented form of the new Whisper Turbo is very well optimized and significantly faster than realtime, definitely so on my hardware (an M1 Mac). openai/whisper#2363 https://www.markhneedham.com/blog/2024/10/02/insanely-fast-whisper-running-openai-whisper-turbo-mac/

413 yes but this is more for quality i'm thinking of here than speed

NicodemPL commented 1 month ago

Certainly, we should prioritize quality. I discovered a notable difference in output quality when using the same Whisper model in Screenpipe compared to processing a full recorded file, rather than the current implementation which uses small chunks. Done some digging and seems that overlap for 2sec and check for repetitive words before finalizing transcript should do the job.

EzraEllette commented 1 month ago

/attempt #431

Algora profile Completed bounties Tech Active attempts Options
@EzraEllette 1 mediar-ai bounty
TypeScript, Rust,
JavaScript & more
Cancel attempt
mlasy commented 1 month ago

i noted that the existing whisper-large model only works with english. Could we support different languages or different whisper models as well?

louis030195 commented 1 month ago

/attempt #431

Algora profile Completed bounties Tech Active attempts Options @EzraEllette 1 mediar-ai bounty
TypeScript, Rust, JavaScript & more Cancel attempt

any update on this?

i want to add diarization using https://github.com/thewh1teagle/pyannote-rs but might overlap with this issue

EzraEllette commented 1 month ago

Diarization is also something I want from screenpipe as well as speaker verification. I meant to cancel my attempt on this since I haven't had the time to work on it, but algora's cancel button doesn't work.

louis030195 commented 1 month ago

Oh okay, I'll have a look at this issue + diarization, etc then

EzraEllette commented 1 month ago

@louis030195 Are you actively working on this? I will have some time to get started tonight.

louis030195 commented 1 month ago

@EzraEllette a bit, i did a simple unit test to measure accuracy of short wav files but it's not really fully reflecting real screenpipe usage

https://github.com/mediar-ai/screenpipe/blob/main/screenpipe-audio/tests/accuracy_test.rs

i improved a bit accuracy using normalization of audio from 57% to 62% (while deepgram is 82%) using whisper turbo

been thinking in switching from candle to whisper-cpp (binding) and adding diarization etc. by just copy pasting the code here https://github.com/thewh1teagle/vibe/blob/main/core/src/transcribe.rs

something else we'd like to do in the future also is to be able to stream transcriptions through a websocket for example in the server for different use cases which might affect the architecture but i think main priority rn is to improve quality

i won't have much time left today and tomorrow morning got bunch of calls so feel free to try things

EzraEllette commented 1 month ago

I have been meaning to make a pipe, but i need realtime data.

I'll pull what you have and see what I can do. Also, I'm going to get a baseline on whisper-cpp's accuracy and go from there.

EzraEllette commented 1 month ago

@louis030195 I'm seeing ~5% improvement with spectral subtraction using the last 100ms frame of unknown status. I might try using the last few hundred ms rather than just one, but for now here this is: image

EzraEllette commented 1 month ago

I'm implementing dynamic range compression to see if that helps.

EzraEllette commented 1 month ago

not seeing a difference

NicodemPL commented 1 month ago

Strange. Usually I get very good results with whisper large but working with long files (15min+) Have you tried to exceed 30s timewindow with 2 sec overlap?

EzraEllette commented 1 month ago

There are errors in the middle of the transcripts so I am focusing on those through audio preprocessing.

EzraEllette commented 1 month ago

I should mention that I changed the sinc interpolation to cubic, which is drastically slower than linear. I updated my PR to reflect that.

I'm trying some other sampling changes but I'm doubtful that it will improve anything.

EzraEllette commented 1 month ago

Deepgram result: image image At least we beat deepgram on the last sample 😆

EzraEllette commented 1 month ago

It's worth mentioning that the Levenshtein distance would be lower if we sanitized the transcription output to remove the hallucinations and timestamps.

I think we can assume that if a transcript has two segments with the same timestamp, the shorter section should be removed. Other than that I'm not sure what you want to do with the timestamps.

louis030195 commented 1 month ago

i think one of the common issue with screenpipe is when someone is speaking then stop then start again in a 30s chunk, whisper will create "Thank you" in silences, that's one thing we should solve somehow through some audio processing hacks i guess

regarding current accuracy metrics i think we could have a second unit test that contains audio recording from screenpipe like 4 and either write the expected transcript manually or use some online transcription service to create the transcript (which make mistake), for current unit test, some of the expected have been done with deepgram which makes a bit of mistake too

honestly even as a human i struggle to transcribe some of the audio recordings sometimes when people have weird accent

also something else we could eventually do is to fix transcript with LLM in real time but I'd expect it a hard task to do it well as it shouldn't take more than 1gb memory and not adding hallucination, not overloading GPU/CPU etc.

another reason i wanted to switch to whisper cpp is that they have more feature like initial prompt:

https://github.com/ggerganov/whisper.cpp/discussions/348

https://github.com/thewh1teagle/vibe/blob/28b17d2dd9f1ffea148731be3e12d7a4efd433f4/core/src/transcribe.rs#L114

which we could put in screenpipe cli arg and app ui settings like "yo my name is louis sometimes i talk about screenpipe, i have french accent so make sure to take this into account ..."

while candle is really barebones we have to reimplement everything ourselves sadly and we don't have time to turn into AI researcher at this point

i guess diarization would improve a little bit accuracy also by running transcription only on frames that belong to specific voice

some rough thoughts, what do you think are the next steps @EzraEllette ?

EzraEllette commented 1 month ago

@louis030195 I have a couple meetings tonight but I'll give you some information afterwards.

Right now it makes more sense to use tools that have more features and are actively maintained by other developers when possible.

I'll contact you once my meetings are finished.

louis030195 commented 1 month ago

@EzraEllette

do you want to refactor to always record audio + send chunks for transcription?

also interested if there could be a way to be able to stream audio + transcription through the API for extensions reasons

also the #170 use case is important

EzraEllette commented 1 month ago

@EzraEllette

do you want to refactor to always record audio + send chunks for transcription?

also interested if there could be a way to be able to stream audio + transcription through the API for extensions reasons

also the #170 use case is important

Yes. I want to make that refactor And explore streaming.

louis030195 commented 1 month ago

adding some context

some user feedback:

image

some users had issue with language eg #451 but i think #469 would solve it?

diarization: https://github.com/thewh1teagle/pyannote-rs - can probably slightly increase accuracy too

other issues with audio:

  • deepgram does not work with macos Display audio (output) for me (i think for weeks)
  • for windows some users transcription does not work at all #374
  • another windows audio issue, not sure what he meant image

on my side i want to prioritize having high quality data infrastructure for audio that works across OSes ideally (MacOS, Windows at least) and UI things is less priority

EzraEllette commented 1 month ago

Speaker Identification and Diarization will be a large undertaking.

Chunking the audio and overlapping is working for now.

Here are some of my thoughts about streaming audio data:

  • Refactoring audio recording to stream the audio data will enable other parts of screenpipe to ingest audio recording data.

  • in addition to streaming, instead of cutting off the audio at X seconds, after X seconds, wait for a pause using VAD to start a new chunk. That way transcriptions won't end in the middle of a sentence. Overlap is effective, but may not be necessary with streaming and using VAD to find a good break in the audio.

louis030195 commented 1 month ago

Speaker Identification and Diarization will be a large undertaking.

Chunking the audio and overlapping is working for now.

Here are some of my thoughts about streaming audio data:

  • Refactoring audio recording to stream the audio data will enable other parts of screenpipe to ingest audio recording data.
  • in addition to streaming, instead of cutting off the audio at X seconds, after X seconds, wait for a pause using VAD to start a new chunk. That way transcriptions won't end in the middle of a sentence. Overlap is effective, but may not be necessary with streaming and using VAD to find a good break in the audio.

agree, let's not do speaker Identification and diarization for now

agree with streaming