Closed thewh1teagle closed 1 month ago
Hey, I sent the message on the rust audio discord. I'm very to new to digital audio processing and rust, but happy to help wherever I can.
I've come to a very similar solutions on my research and even got a test app working using screencapture kit + whisper cpp.
To convert the audio for whisper, i use only the left channel and only took every 3rd frame from the buffer, but obviously thats a hack and it needs to be done properly.
Also found this example for handling streaming in the whisper cpp repo: https://github.com/ggerganov/whisper.cpp/blob/master/examples/stream/stream.cpp
And another repo we could check is obs, they handle it well. I took a look but wasn't able to understand it (I'm a C noob).
Lets work on this together, really love the Idea of vibe!
@quinn-eschenbach
Sounds great! Currently, the challenges I'm facing are:
I don't have knowledge of the subject but been following whisper related colabs and believe the following link might help https://github.com/Sourasky-DHLAB/Whisper Especially notebook 4. https://github.com/Sourasky-DHLAB/Whisper/blob/main/Colab/Whisper_Speaker_Diarization.ipynb
I've made some progress; it will be added soon.
As starting point it will be possible by simply record mic / speakers / both. when finished it's just like transcribe any audio file in Vibe.
https://github.com/thewh1teagle/vibe/tree/feat/record
https://github.com/zmwangx/rust-ffmpeg/discussions/73 (For merging audio files after recording)
https://github.com/RustAudio/cpal/issues/876 (Probably will be added soon for macOS)
https://discord.com/channels/548404410439696434/1248439946411249695
https://github.com/zmwangx/rust-ffmpeg/discussions/103
https://github.com/zmwangx/rust-ffmpeg/discussions/73
https://discord.com/channels/590254806208217089/590257558317695005/1248448876462211123
Merged. Just need to add audio merging support with ffmpeg and in future to update cpal with screencapturekit
Added in 2.0.2
Goal
Transcribe system audio / microphone (single or both) and preview it in realtime
Research
Possible to follow approaches in https://github.com/CapSoftware/Cap
Useful Rust Crate https://github.com/helmerapp/scap
Perhaps on: macOS: https://github.com/svtlabs/screencapturekit-rs (
screen capture kit
)Graphics.capture
on Windows (https://github.com/NiiightmareXD/windows-capture)macOS app which provides a way to capture system audio using ScreenCaptureKit API https://github.com/Mnpn/Azayaka
Microsoft answer for how audacity manage to record audio from speakers (TLDR:
Windows WASAPI
) https://answers.microsoft.com/en-us/windows/forum/all/how-record-speaker-output-windows-10/251bb695-5170-4a35-a90f-42d9f6f3345aMacOS sample https://gist.github.com/thewh1teagle/d02415b9768fd816a780f9af6a3f2bdb
Loopback added to cpal https://github.com/RustAudio/cpal/pull/478 (working in windows)
Additional questions: How to get system audio + microfone at the same time into single stream Linux?
TLDR
Rust crate
cpal
provides a way to get audio stream from microfone(s) On Windows it also provides audio stream from default output device (system audio) On macOS we should usescreencapturekit-rs
and provide stream which is equivalent to cpal stream.If two streams used, then mix them by adding both (simple addition to the sample(s) numbers works) Push them to whisper in loop Mixing can introduce synchronization issues (is it's two different sound cards etc) and RtAudio handle that better and possible to use through rtaudio-rs whisper.cpp expects single channel (mono) 16khz rate and size of 16 bit Probably need resampling, and converting to mono from stereo is by mean of both.
Simple approach
Record from speakers/mic concurrently and write to file every 5-10 at the best silent position Write to queue of paths (each item will be one or two paths) Another task which iterate the queue, merge if needed, and transcribe it.
https://github.com/ggerganov/whisper.cpp/tree/master/examples/stream#sliding-window-mode-with-vad