Real-time Extraction of Screen Frames and Speech from Jitsi WebRTC Call

tomsmith8 commented 6 months ago

Description

Provide us with a how to solution to extract periodic screen frames and audio for speech recognition in real-time from a Jitsi WebRTC call. The extracted frames and audio would be processed for further analysis via another API.

Objectives

Explain how to capture video frames from the WebRTC stream in real-time.
Explain how to capture audio from the WebRTC stream
Explain the implementation and provide architecture on how to achieve these tasks.

Suggested Tasks to be reviewed for anything missing:

Access Media Streams

Access the video, audio and screen recording streams on the Jitsi Call
Ensure media streams are correctly identified and accessible.

Capture Video Frames

Extract image data from the canvas and prepare it for processing.
Implement a function to continuously capture frames at regular intervals? Chunks? for real-time processing while the recording/call is still happening

Capture and Process Audio

Implement a script processor to capture audio data in real-time?
Ensure the audio data is correctly buffered and ready for speech recognition.

Provide a detailed explanation of the implementation process above on whether its the correct approach. Please provide alternative or additional notes on how to process audio, video and screen recording in real-time

Acceptance Criteria

[ ] The solution should successfully capture and process video frames in real-time.
[ ] The audio stream should be captured and be ready to process in real-time
[ ] Detailed documentation and explanation of the implementation should be provided.
[ ] Prototype MVP program a nice to have - Bounty will be boosted

JZ1999 commented 6 months ago

Hi @tomsmith8 I would like to help out with this!

gotohigher commented 6 months ago

Accessing Media Streams

The first step is to access media streams from the Jitsi Call. This would involve tapping into the WebRTC API to access video, audio, and screen recording streams. We'll ensure each is identified correctly and is accessible.

Capturing Video Frames

Capturing video frames can be accomplished via a canvas context. We'd draw the current video frame onto an HTML canvas object, then use the getImageData method periodically to extract frames for real-time processing. A buffer procedure would be in place to handle all this smoothly without interfering with the active call.

Capturing and Processing Audio

We'll leverage the Web Audio API to capture audio. We can use the ScriptProcessorNode (or AudioWorklet for more modern contexts) to process audio samples in real-time. These audio packets can then be stored in a buffer ready for speech recognition.

As for a detailed documentation and MVP program, • I'd go about producing a comprehensive document that explains each step of the implementation process. However, it's crucial to understand that while I offer insight on best practices and potential challenges to look out for, the actual implementation would depend on specific requirements and constraints. • Regarding a prototype MVP program, I can certainly provide a basic structure and planning for how we would develop an MVP. We would strictly define the minimal features to capture and process the audio-video details, ultimately providing a lean program that serves as a solid base for future enhancements.

tomsmith8 commented 5 months ago

@JZ1999 Any update on providing a documented solution with Jitsi/Jibri/webRCT for real-time streaming?

hkarani commented 5 months ago

@tomsmith8 can I work on this? My sphinx username is asterisk32 https://community.sphinx.chat/p/cmv6tnqtu2rk819pr5mg/assigned

tomsmith8 commented 5 months ago

@hkarani sure - we're looking for a provided solution for the bounty. Once we have a provided solution we're happy with we'll look to break the solution out into further bounties (implementation)

stakwork / sphinx-mac

Real-time Extraction of Screen Frames and Speech from Jitsi WebRTC Call #392