Open RLinnae opened 1 year ago
Haha, was asking myself the same at the same moment ;-)
This is like a #1 feature request for any whisper impelentation and it seems it's hard to do đ
I have a repo Here that does this, I explored using whisper-jax instead of faster-whisper but the results weren't promising using GPU (RTX3070Ti)
Hey all! As far as I'm aware, there isn't a dedicated speaker diarization model in JAX. If anyone knows of one, we could get the sentence level timestamps from Whisper JAX, and the speaker turn timestamps from the diarization model, and then segment the transcript based on the two sets of timestamps (we do something similar in Speechbox, where we have both models in PyTorch: https://github.com/huggingface/speechbox/tree/main#asr-with-speaker-diarization)
Without a speaker diarization model in JAX, we would have to run the Whisper model in JAX, but the speaker diarization model in PyTorch -> I'm not sure how feasible it is to have a JAX and PyTorch model running on the same device. Probably the best thing to do here would be to run them sequentially (one after the other), or have one GPU running speaker diarization in PyTorch, and one TPU running speech transcription in JAX, and then transfer the results between the two (I think this is the current fastest way you could do it).
If anyone knows of one, we could get the sentence level timestamps from Whisper JAX, and the speaker turn timestamps from the diarization model, and then segment the transcript based on the two sets of timestamps
I'm just an amateur interested in this topic, but I've had luck so far with this solution which uses Pyannote.audio to perform diarization: https://github.com/m-bain/whisperX
I believe this also aims to achieve the same thing: https://github.com/MahmoudAshraf97/whisper-diarization
Additionally, I've tried this solution and it also does the job though I don't know how well-maintained it is (I could reform the code for Google Colab and run it there), it also uses Pyannote: https://huggingface.co/spaces/vumichien/Whisper_speaker_diarization
@7k50 , I've played with Pyannote.audio which seems great although install is a hassle on a M1 (compatibility issues with Torch and all...) but seems one of the best solution for diarization.
I ran this interview between three speakers in Spanish through whisper, got the transcript with timecodes. Submitted to gpt , roughly describing who the speakers are and asked to break it down accordingly. Got perfect result back. One way around how to deal with longer transcripts is to break down the transcript into multiple files and run a loop with a prompt to gpt, and then patch it all together. I am looking for a local solution, without involving any outside LLMs, for doctors and lawyers records.
Here is an excerpt from that interview:
`Interviewer: Welcome Hayek and Banderas, it's a pleasure to have you both here today. How are you both feeling?
Hayek: ÂĄHola! Estamos muy emocionados de estar aquĂ. Gracias por invitarnos. (Hello! We are very excited to be here. Thank you for inviting us.)
Banderas: SĂ, estamos encantados de compartir nuestro tiempo contigo y hablar sobre nuestra Ășltima pelĂcula. (Yes, we are delighted to share our time with you and talk about our latest movie.)
Interviewer: I'm glad to hear that. Let's dive right in. Can you both share your experiences working on your latest project together?
Hayek: Claro, fue increĂble volver a trabajar con Antonio despuĂ©s de tanto tiempo. Siempre es un placer, y creo que nuestra quĂmica en la pantalla es aĂșn mĂĄs fuerte que antes. (Of course, it was amazing to work with Antonio again after so long. It's always a pleasure, and I think our on-screen chemistry is even stronger than before.)
Banderas: Estoy de acuerdo. Salma y yo nos llevamos muy bien y siempre nos divertimos mucho en el set. AdemĂĄs, creo que nuestra amistad fuera de la pantalla se traduce en una gran quĂmica en la pantalla. (I agree. Salma and I get along very well and we always have a lot of fun on set. Also, I think our off-screen friendship translates into great on-screen chemistry.)`
These are all great speaker diarization models, but all ones that run on PyTorch only - as mentioned before, if we want to maximise performance of Whisper JAX, we'll need a speaker diarization model that runs in JAX as well.
Otherwise, if we're happy running the speaker diarization model in PyTorch (slower than JAX), we can try using the SpeechBox implementation of Whisper + Speaker Diarization: https://github.com/huggingface/speechbox/tree/main#asr-with-speaker-diarization
Here, we'd run the Whisper model in JAX (Whisper JAX), and the speaker diarization model in PyTorch (pyannote.audio). We'd then merge the outputs of the two to get our diarised text. A code snippet for this would be:
from pyannote.audio import Pipeline
from whisper_jax import FlaxWhisperPipeline
from speechbox import ASRDiarizationPipeline
import jax.numpy as jnp
diarization_pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization", use_auth_token=use_auth_token)
asr_pipeline = FlaxWhisperPipeline("openai/whisper-small", dtype=jnp.float16, batch_size=16)
pipeline = ASRDiarizationPipeline(asr_pipeline=asr_pipeline, diarization_pipeline=diarization_pipeline)
# do a compilation step of the FlaxWhisperPipeline to get it out of the way
text = asr_pipeline("audio.mp3", return_timestamps=True)
# now we can use our combined pipeline for ASR + Speaker Diarization
diarized_text = pipeline("audio.mp3", return_timestamps=True)
Hey. Thanks for the developments so far. Im getting a error after implementing previously mentioned code.
ERROR: TypeError: SpeakerDiarization.apply() got an unexpected keyword argument 'return_timestamps'
Should I not use this parameter?
Is there a way to fix this somehow?
Is it something to do with Speechbox? I see that there has been merging with it, but it doesnt seem to be applied in the code? Or im missing something here.
The speechbox demo is down as well with a error:"...got an unexpected keyword argument 'token'"
Is there a recommended method to implement speaker diarization with this whisper solution?