sanchit-gandhi / whisper-jax

JAX implementation of OpenAI's Whisper model for up to 70x speed-up on TPU.
Apache License 2.0
4.45k stars 385 forks source link

realtime transcriptions #3

Open eschmidbauer opened 1 year ago

eschmidbauer commented 1 year ago

Hi- appreciate sharing of this framework, it looks very useful I'm wondering if it's possible to do real-time transcriptions using from transformers.pipelines.audio_utils import ffmpeg_microphone_live as detailed in this PR:

https://github.com/huggingface/transformers/pull/21196

JonathanFly commented 1 year ago

Hi- appreciate sharing of this framework, it looks very useful I'm wondering if it's possible to do real-time transcriptions using from transformers.pipelines.audio_utils import ffmpeg_microphone_live as detailed in this PR:

huggingface/transformers#21196

I'll try and test this today, you can just feed in segments in a loop to benchmark what it would do when integrated into something that takes live audio. You lose the batching benefits of course, which is the main speedup in whisper-jax. Perhaps you could send overlapping audio segments in a batch, as https://github.com/openai/whisper/discussions/608 does, and batch the the audio you are re-running for the updated corrected transcription?

I've never used Jax before. Anyone know if there are performance differences between the various CUDA/CUDNN wheels? I've already got 11.8 and CuDNN 8.8, is there any point to testing the Cuda 12.0 wheel, or is not going to be any faster?

Edit: I'm getting a billion CUDA_ERROR_OUT_OF_MEMORY errors with anything bigger than the small model. I assumed it was broken, it actually still works with the larger models, even though it looks like everything is blowing up.

Tronic commented 1 year ago

Streaming in the audio and having low latency transcription output would be nice, yes. A part of the problem is that you don't really know whether you need to listen longer before outputting text (especially so in translate mode). But a way to stream in audio and to stream out text continuously would definitely be nice, more correct and faster than doing it manually in chunks (e.g. by silence detection).

sanchit-gandhi commented 1 year ago

Hey @eschmidbauer, @JonathanFly, @Tronic,

I've not tried this, but we'd need to re-work the Flax Whisper Pipeline to accept a generator and return a generator for this to work. It could look something like:

def live_transcription(mic, batch_size, task, return_timestamps):
    dataloader = pipeline.preprocess_batch(mic, batch_size=batch_size)
    for batch in dataloader:
        tokens = pipeline.forward(batch, batch_size=batch_size, task=task, return_timestamps=return_timestamps)
        post_processed = pipeline.postprocess([tokens], return_timestamps=return_timestamps)
        yield post_processed

And then use the code-snippet from the transformers PR, with the one change:

- for item in pipe(mic):
+ for item in live_transcription(mic, batch_size=16, task="transcribe", return_timestamps=False):
creatorrr commented 1 year ago

@sanchit-gandhi I'd be happy to help with this. Any pointers?

torshak-mozyora commented 1 year ago

Is anybody working on this? Or somebody could guide me.

rodrigoGA commented 1 year ago

Perhaps it could be integrated with this https://github.com/ufal/whisper_streaming

Srishti1111 commented 1 month ago

Hi, It would be great if whisper jax can be used for live streaming transcription, any work going on that?

FerLuisxd commented 1 month ago

+1 to this! Since this is the fastest whisper, it would be good to use it for real time transcriptions