m-bain / whisperX

WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization)
BSD 2-Clause "Simplified" License
11.34k stars 1.19k forks source link

Diarization too slow #274

Open MitPitt opened 1 year ago

MitPitt commented 1 year ago

1 hour 30 minutes of audio were processing for over 1 hour in the diarization... stage. I'm using an RTX 3090.

I'm guessing --batch_size doesn't affect pyannote. A setting for pyannote's batch size would be very nice to have.

jzeller2011 commented 1 year ago

I'm having the same issue. From what i'm reading, the pyannote/speaker-diarization model is slow, but word-level segmentation may be slowing it down even more. I assume there are factors that impact this more than others (i think number of speakers or number of segments influences this the most, but that's just a guess). Looking at hardware usage during runtime, looks like it's batching either one segment at a time or one word at a time (this would make sense, since we're chasing word-level timestamps with whisperx. The pyannote model reports a 2.5% realtime factor, which is definitely NOT been my experience, but may be the case if you ran the raw audio through without segmentation). Maybe there's a way to count individual calls to the GPU to verify. I haven't found a workaround yet, let me know if you find something out.

moritzbrantner commented 1 year ago

I have the same issue.

DigilConfianz commented 1 year ago

https://github.com/m-bain/whisperX/issues/159#issuecomment-1540035916

m-bain commented 1 year ago

1 hour 30 minutes of audio were processing for over 1 hour in the diarization... stage. I'm using an RTX 3090.

That's very strange, it should not be that long, I would expect 5-10mins max. I suspect some bug here.

I'm guessing --batch_size doesn't affect pyannote. A setting for pyannote's batch size would be very nice to have.

I would assume most of the time is the clustering step, which can be recursive and can take long if its not finding satisfactory cluster sizes.

From what i'm reading, the pyannote/speaker-diarization model is slow, but word-level segmentation may be slowing it down even more.

Nah the ASR and word-level segmentation is ran independently of the diarization. The diarization is just running a standard pyannote pipeline. So word-level segmentation / whisperx batching shouldnt effect this

geoglrb commented 1 year ago

@m-bain I'm also having extremely slow diarization. Using CLI.

Just now, to explore further, I also tried setting the --threads parameter to 50 to see if that would do something (I would prefer GPU!) and it is now making use of a variable number of threads, but well about four, which is what it had seemed to be limited to by default. There is still some GPU memory allocated even in the diarization stage, but not a ton. Very naive question--could things be slow because all of us have pyannote using CPU for some reason? Is there a way to specify that whisperx's pyannote must use GPU?

For reference, in case it helps:

>>> import torch
>>> torch.cuda.is_available()
True
>>> torch.cuda.device_count()
2
>>> torch.version.cuda
'11.7'
sorgfresser commented 1 year ago

There is an issue regarding pyannote not using GPU, but it should not occur with whisperx. To read more on this, see pyannote/pyannote-audio#1354. It might have something to do with the device index though. Are both of your GPUs the same size? We're currently not passing device_index to the diarization, so we will simply do to('cuda') on loading the diarization model. This might be a problem when multiple GPUs are available.

goneill commented 1 year ago

I am also having an extremely long, ie overnight, diarization on the command line. The transcription occurs, I get two failures in the align segment and then diarization occurs, and I get the following errors:

Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.0.2. To apply the upgrade to your files permanently, run python -m pytorch_lightning.utilities.upgrade_checkpoint --file ../.cache/torch/pyannote/models--pyannote--segmentation/snapshots/c4c8ceafcbb3a7a280c2d357aee9fbc9b0be7f9b/pytorch_model.bin Model was trained with pyannote.audio 0.0.1, yours is 2.1.1. Bad things might happen unless you revert pyannote.audio to 0.x. Model was trained with torch 1.10.0+cu102, yours is 2.0.1. Bad things might happen unless you revert torch to 1.x.

and then I left it running overnight and still in the same state.

davidas1 commented 1 year ago

Please try my suggestion in https://github.com/m-bain/whisperX/issues/399 and see if it helps you too. I'm getting around 30sec for diarization of 30 minute video using the standard model in the pyannote/speaker-diarization pipeline (speechbrain/spkrec-ecapa-voxceleb), and around 15sec if I change the embedding model to pyannote/embedding

DigilConfianz commented 1 year ago

@davidas1 There is speed improvement when changing to whisper loaded audio from the raw audio file as you suggested. Thanks for that. How to change the embedding model in code?

davidas1 commented 1 year ago

Changing the pyannote pipeline is a bit more involved - I'm using an offline pipeline like described in https://github.com/pyannote/pyannote-audio/blob/develop/tutorials/applying_a_pipeline.ipynb I had to patch whisperx a bit to allow working with a custom local pipeline. Using this method you can customize the pipeline by editing the config.yaml (change the "embedding" configuration to the desired model).

datacurse commented 1 year ago

Please try my suggestion in #399 and see if it helps you too. I'm getting around 30sec for diarization of 30 minute video using the standard model in the pyannote/speaker-diarization pipeline (speechbrain/spkrec-ecapa-voxceleb), and around 15sec if I change the embedding model to pyannote/embedding

what??? thats crazy! here is my timings for 30 minute long mp3: transcribe time: 69 seconds align time: 10 seconds diarization: 24 seconds around 90 seconds in total, like 3 times longer than yours, and thats excluding the initial model loadings.

could you please suggest something like a checklist for speeding things up? i also updated to get your recet patch and it did speed up my diarization exponentially

davidas1 commented 1 year ago

I wrote that diarization takes 30sec, not the entire pipeline - before the change the diarization took almost 2 minutes. Your timing looks great, other than the transcribe step that is faster on my setup, but that's probably due to the GPU you're using.

datacurse commented 1 year ago

oooh i see that clears things. i got 4090 tho

dantheman0207 commented 1 year ago

I'm looking for some help or insight into why diarization is so slow for me.

I have a recording that is 1 minute and 14 seconds with two native English speakers and diarization takes 11 minutes and 49 seconds (transcription took 6 seconds). I'm running on a Mac mini with an M2 chip and 8GB of RAM. I assume in this case it's running on CPU although I'm not sure with the Apple silicon. I'm basically using the default example on the README for transcribing and diarizing a file.

With a longer file (27 minutes and 39 seconds), with multiple speakers, it takes 2 minutes and 47 seconds to transcribe, 1 minute and 6 seconds to align but 12 hours, 48 minutes to diarize!

awhillas commented 9 months ago

Same here. I'm getting 2-3% GPU utilization 0.9 GB of GPU memory?

SergeiKarulin commented 5 months ago

same issue. Almost no GPU utilization and 1.5 hour of diarization per 60 minutes audio.

eplinux commented 5 months ago

same issue. Almost no GPU utilization and 1.5 hour of diarization per 60 minutes audio.

same here

eplinux commented 4 months ago

I also noticed that there seems to be some throttling affecting the GPU utilization on Windows 11. As soon as the terminal window is in the background, the GPU utilization drops dramatically

prkumar112451 commented 4 months ago

@m-bain Diarization is a key aspect where multiple speakers are having a conversation. I've been exploring different ways to speed up transcription & diarization pipeline.

Can see lots of different options for speeding up transcription like : CTranslate2, Batching, Flash Attention, Distil-Whisper, ComputeTime (float32,16)

but finding very limited options for diarization speedup.

for a 20 minutes audio, with optimizations we are able to get transcriptions in around 35 seconds. But diarizing a 20 minute audio is taking roughly 1 minute via Nemo and around 45 seconds via Pyannote.

Could you please share if there is any direction which we can follow to speedup diarization process?