sanchit-gandhi / whisper-jax

JAX implementation of OpenAI's Whisper model for up to 70x speed-up on TPU.
Apache License 2.0
4.46k stars 384 forks source link

Speaker diarization? #25

Open RLinnae opened 1 year ago

RLinnae commented 1 year ago

Is there a recommended method to implement speaker diarization with this whisper solution?

PierreVannier commented 1 year ago

Haha, was asking myself the same at the same moment ;-)

Cordo-van-Saviour commented 1 year ago

This is like a #1 feature request for any whisper impelentation and it seems it's hard to do 😞

MahmoudAshraf97 commented 1 year ago

I have a repo Here that does this, I explored using whisper-jax instead of faster-whisper but the results weren't promising using GPU (RTX3070Ti)

sanchit-gandhi commented 1 year ago

Hey all! As far as I'm aware, there isn't a dedicated speaker diarization model in JAX. If anyone knows of one, we could get the sentence level timestamps from Whisper JAX, and the speaker turn timestamps from the diarization model, and then segment the transcript based on the two sets of timestamps (we do something similar in Speechbox, where we have both models in PyTorch: https://github.com/huggingface/speechbox/tree/main#asr-with-speaker-diarization)

Without a speaker diarization model in JAX, we would have to run the Whisper model in JAX, but the speaker diarization model in PyTorch -> I'm not sure how feasible it is to have a JAX and PyTorch model running on the same device. Probably the best thing to do here would be to run them sequentially (one after the other), or have one GPU running speaker diarization in PyTorch, and one TPU running speech transcription in JAX, and then transfer the results between the two (I think this is the current fastest way you could do it).

7k50 commented 1 year ago

If anyone knows of one, we could get the sentence level timestamps from Whisper JAX, and the speaker turn timestamps from the diarization model, and then segment the transcript based on the two sets of timestamps

I'm just an amateur interested in this topic, but I've had luck so far with this solution which uses Pyannote.audio to perform diarization: https://github.com/m-bain/whisperX

I believe this also aims to achieve the same thing: https://github.com/MahmoudAshraf97/whisper-diarization

Additionally, I've tried this solution and it also does the job though I don't know how well-maintained it is (I could reform the code for Google Colab and run it there), it also uses Pyannote: https://huggingface.co/spaces/vumichien/Whisper_speaker_diarization

PierreVannier commented 1 year ago

@7k50 , I've played with Pyannote.audio which seems great although install is a hassle on a M1 (compatibility issues with Torch and all...) but seems one of the best solution for diarization.

text2sql commented 1 year ago

I ran this interview between three speakers in Spanish through whisper, got the transcript with timecodes. Submitted to gpt , roughly describing who the speakers are and asked to break it down accordingly. Got perfect result back. One way around how to deal with longer transcripts is to break down the transcript into multiple files and run a loop with a prompt to gpt, and then patch it all together. I am looking for a local solution, without involving any outside LLMs, for doctors and lawyers records.

Here is an excerpt from that interview:

`Interviewer: Welcome Hayek and Banderas, it's a pleasure to have you both here today. How are you both feeling?

Hayek: ÂĄHola! Estamos muy emocionados de estar aquĂ­. Gracias por invitarnos. (Hello! We are very excited to be here. Thank you for inviting us.)

Banderas: SĂ­, estamos encantados de compartir nuestro tiempo contigo y hablar sobre nuestra Ășltima pelĂ­cula. (Yes, we are delighted to share our time with you and talk about our latest movie.)

Interviewer: I'm glad to hear that. Let's dive right in. Can you both share your experiences working on your latest project together?

Hayek: Claro, fue increĂ­ble volver a trabajar con Antonio despuĂ©s de tanto tiempo. Siempre es un placer, y creo que nuestra quĂ­mica en la pantalla es aĂșn mĂĄs fuerte que antes. (Of course, it was amazing to work with Antonio again after so long. It's always a pleasure, and I think our on-screen chemistry is even stronger than before.)

Banderas: Estoy de acuerdo. Salma y yo nos llevamos muy bien y siempre nos divertimos mucho en el set. AdemĂĄs, creo que nuestra amistad fuera de la pantalla se traduce en una gran quĂ­mica en la pantalla. (I agree. Salma and I get along very well and we always have a lot of fun on set. Also, I think our off-screen friendship translates into great on-screen chemistry.)`

sanchit-gandhi commented 1 year ago

These are all great speaker diarization models, but all ones that run on PyTorch only - as mentioned before, if we want to maximise performance of Whisper JAX, we'll need a speaker diarization model that runs in JAX as well.

Otherwise, if we're happy running the speaker diarization model in PyTorch (slower than JAX), we can try using the SpeechBox implementation of Whisper + Speaker Diarization: https://github.com/huggingface/speechbox/tree/main#asr-with-speaker-diarization

Here, we'd run the Whisper model in JAX (Whisper JAX), and the speaker diarization model in PyTorch (pyannote.audio). We'd then merge the outputs of the two to get our diarised text. A code snippet for this would be:

from pyannote.audio import Pipeline
from whisper_jax import FlaxWhisperPipeline
from speechbox import ASRDiarizationPipeline
import jax.numpy as jnp

diarization_pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization", use_auth_token=use_auth_token)
asr_pipeline = FlaxWhisperPipeline("openai/whisper-small", dtype=jnp.float16, batch_size=16)

pipeline = ASRDiarizationPipeline(asr_pipeline=asr_pipeline, diarization_pipeline=diarization_pipeline)

# do a compilation step of the FlaxWhisperPipeline to get it out of the way
text = asr_pipeline("audio.mp3", return_timestamps=True)

# now we can use our combined pipeline for ASR + Speaker Diarization
diarized_text = pipeline("audio.mp3", return_timestamps=True)
ElmyMaty commented 12 months ago

Hey. Thanks for the developments so far. Im getting a error after implementing previously mentioned code.

ERROR: TypeError: SpeakerDiarization.apply() got an unexpected keyword argument 'return_timestamps'

Should I not use this parameter?

Is there a way to fix this somehow?

ElmyMaty commented 12 months ago

Is it something to do with Speechbox? I see that there has been merging with it, but it doesnt seem to be applied in the code? Or im missing something here.

The speechbox demo is down as well with a error:"...got an unexpected keyword argument 'token'"