Implemented Real-time Audio Transcription with Speaker Diarization

ethanzrd commented 1 year ago

Changelog:

Client Modifications:

Replaced the usage of react-mic with AudioContext for audio processing.
Eliminated the interim data variable and all its references, leading to a more efficient and cleaner codebase.
Implemented WebSocket functionality for continuous audio streaming to the server, significantly reducing latency and improving real-time transcription capabilities.
Removed the variable responsible for checking transcription status and its references, as the server now seamlessly receives audio chunks.
Refactored components to accommodate the new configuration, making the client-side implementation more maintainable.

Server Modifications:

Completely rebuilt the server as an event-driven asynchronous server, replacing the previous synchronous server implementation (app.py file removed). This change brings scalability and performance improvements.
Integrated the Diart library for real-time speaker diarization, allowing for the identification of different speakers during transcription.
Incorporated the faster-whisper and stable-ts libraries for accurate transcription timestamps by aligning transcriptions with audio, leading to improved speaker diarization, low memory usage, and blazing-fast transcriptions with larger beam sizes.
Enhanced transcription context by maintaining a buffer of past transcriptions for each client, enabling model conditioning through Whisper's "Prompting" feature.
The stream now transcribes data in batches, optimizing the transcription process for better efficiency and speed.
Implemented efficient handling of sudden disconnects and stream stoppings, ensuring all remaining audio data is transcribed and sent to the client before the connection shuts off.
Detailed logging of the server's operations using Python's built-in logging library, facilitating debugging and monitoring.

Future modifications:

Implement VAD to prevent transcription/diarization hallucinations.
Add multi-client support to enable concurrent audio streaming and transcription for multiple users, making the system more versatile and accessible.
Modify the speaker-embedding brain of the diarization pipeline.
Add support for more Whisper implementations.

Credits:

Color Your Captions: Streamlining Live Transcriptions With “diart” and OpenAI’s Whisper - This allowed for the well thought-out implementation of Diart and Whisper I've used here. Thanks to Juanma Coria (the creator of Diart)!

ethanzrd commented 1 year ago

Known bugs:

When tested in a Google Colab environment with the large-v2 model, the socket appears to be disconnecting & reconnecting during the model's initialization. This is a behavior I've found to be associated with blocking operations, I will be trying to resolve it by offloading the model's initialization.
The app may occasionally generate a transcription that contains nothing but the same word multiple times, I've yet to understand why. This seems to not be happening in a Google Colab environment.

Edit: The latest commit enhancing responsiveness should fix the first problem. If the problem is indeed caused due to socket timeouts, this should solve it. It should also pave the way to multi-client support which will be explored on July 28th.

saharmor commented 1 year ago

It doesn't have to be through Conda. You can also install via other means, e.g. pip install where appropriate).

It's somewhat of an overkill to install Conda for this purpose alone (weight 3GB and installs many things).

Sent via Superhuman ( @.*** )

On Fri, Aug 04, 2023 at 11:12:06, Ethan Zerad < @.*** > wrote:

@.**** commented on this pull request.

In install_playground. sh ( https://github.com/saharmor/whisper-playground/pull/23#discussion_r1284715363 ) :

\ No newline at end of file +cd ../backend

That's what Diart requires:

conda install portaudio pysoundfile ffmpeg -c conda-forge

— Reply to this email directly, view it on GitHub ( https://github.com/saharmor/whisper-playground/pull/23#discussion_r1284715363 ) , or unsubscribe ( https://github.com/notifications/unsubscribe-auth/ABPE22IJKWVBMZ7JIZHHHCLXTU3PNANCNFSM6AAAAAA2YKT4RA ). You are receiving this because you commented. Message ID: <saharmor/whisper-playground/pull/23/review/1563350063 @ github. com>

saharmor / whisper-playground

Implemented Real-time Audio Transcription with Speaker Diarization #23