Audio frames streaming and OPUS packets

JustNello commented 8 months ago

Hello, thanks for this project :)

I'd like to transcribe an audio track with Deepgram, but I have some issues.

The application The client is made of LiveKit React Components (ie. LiveKitRoom and AudioConference), with REDundant encoding disabled when a client joins a room, as described in the docs.

The server uses this Python SDK and it has been implemented starting from the Whisper example. In my case, the "whisper_task" has been replaced by a "deepgram_task", in this gist.

Issue I think I'm not getting how the AudioFrame (from rtc.AudioFrame package) encodes data. I'm new to audio streaming at all and this may be the cause of the issue. I know that the audio format is OPUS, but:

Is either a containerized audio stream or raw audio stream? I refer to this
What is the frame duration? Deepgram requires that the streaming buffer sizes should be between 20 milliseconds and 250 milliseconds of audio

In other words, what does bytes(frame.data) return? Is the OPUS packet? I'm not able to inspect the packet using a packet inspector.

Thank you in advance for any help you may give, Luca

theomonnom commented 8 months ago

Hey Luca! The frames you receive from the AudioStream are raw signed PCM. Looking at the docs from Deepgram, they do support linear16. I've already used Deepgram before, I think you can just connect to their websocket and send the frames you receive from livekit directly. (Also don't forget to use the right sample rate)

JustNello commented 8 months ago

Awesome, it works 😀

One last question to improve my understanding: rtc.RemoteTrackPublication.mime_type yields audio/opus. When is the audio converted to signed PCM?

theomonnom commented 8 months ago

The mime_type represents the codec utilized during the transmission of media to the recipient. Upon receipt, libwebrtc will immediately decode this media.

livekit / python-sdks

Audio frames streaming and OPUS packets #96