openai / openai-python

The official Python library for the OpenAI API
https://pypi.org/project/openai/
Apache License 2.0
21.91k stars 3.01k forks source link

Support for real time audio streaming using chunk transfer encoding for Whisper #1025

Open Shulyaka opened 8 months ago

Shulyaka commented 8 months ago

Confirm this is a feature request for the Python library and not the underlying OpenAI API.

Describe the feature or improvement you're requesting

It would be nice to start data transfer as soon as it becomes available for the real-time voice recognition. We already have a similar feature for tts: https://platform.openai.com/docs/guides/text-to-speech/streaming-real-time-audio Please note, I am not saying that a transcript should be available before the speech ended. But I would like to start the data transfer earlier.

Additional context

The HTTP supports sending files in chunks without knowing the length in advance. A WAV header does require the length, however 0xFFFFFFFF (i.e. max length) works fine with Whisper (I checked).

rattrayalex commented 8 months ago

Just to clarify, your request pertains to uploading audio files with in openai.audio.transcriptions.create() endpoint, correct?

We do want to support streaming request bodies in this way, but unfortunately I'm not sure that we'll be able to get to it soon.

Shulyaka commented 8 months ago

Yes, correct. Thank you!

Ga68 commented 2 months ago

It's not exactly what you're describing, but it is kind of related and I figured those on this thread may find it useful. It links together a whole chain such that you can stream the audio response to a prompt. It works using threading by using one thread to stream the text reply into phrases which are enqueued for TTS. Then a second thread which TTS's each phrase as it completes. And finally a third thread which starts playing out loud each phrase as it's been TTS'd. The final effect is much like working with the ChatGPT app where you get "streaming audio response" to your question and don't have to wait to have the full text come back before you can start listening to audio. What's here I'm sure could be improved and it's primarily designed to show, in a terminal, it all put together.

https://gist.github.com/Ga68/3862688ab55b9d9b41256572b1fedc67