Very low GPU usage(5%) and slow diarization

Meowzz95 commented 1 year ago

I noticed that the diarization process is VERY slow, took about an hour for a 2-hour audio and it seems that the time taken is exponentially increasing against the audio length.

I wanted to find out where the bottleneck is so I observe the CPU and GPU usage.

It seems that GPU is under a very light load while CPU usage is almost full(100% used for a core).

Does anyone know how we can make more utilization of the GPU while doing diarization? Thanks!

github-actions[bot] commented 1 year ago

We found the following entry in the FAQ which you may find helpful:

Does pyannote support streaming speaker diarization?

Feel free to close this issue if you found an answer in the FAQ. Otherwise, please give us a little time to review.

This is an automated reply, generated by FAQtory

NikitaKononov commented 1 year ago

+1

RTX 4090 + Ryzen 5900X +64GB RAM Gpu utilization is <4%, CPU load ~80%

3 hours of speech processed for 1 hour and 14 minutes

TheNiggler commented 1 year ago

I looked into this a bit just now and noticed that not only is CPU usage high but disk IO is high as well. Interrupting the running process a few times as a crude form of profiling reveals that the bottleneck seems to be mostly in "pyannote/audio/core/io.py", function "crop", where it calls torchaudio.load to load tiny little itty bitty slices of the file from disk, one by one... oddly, if you look at the except block below there, it looks like it's written such that if loading a slice fails, it loads the entire file into memory and caches it instead, which seems to me like it should obviously be the default behavior as it would be way faster and do you really usually process audio files so huge they won't fit into memory? Unfortunately, there seems to also be some other bug in this fallback processing that causes it to spit out an incorrectly sized tensor and fail further up the stack, so just forcing it to go into the "except" branch does not work. I didn't look into this bug completely but I'd guess it's due to the extra call to downmix_andresample which happens inside self.__call_\, or a side effect of having the "waveform" key inside the "file" object from there on out as that's referenced in a few places too. However, doing the same thing but just calling torchaudio.load directly and using a key other than "waveform" to cache it (see image), works as expected and gives a speed up of several orders of magnitude. Note that even after this CPU usage is still much higher than GPU, but the processing is quick enough (about a minute or so per hour of audio on my machine) that it doesn't really matter.

I'm too lazy to go through the github ritual of making a PR or whatever to fix this, but it seems like low hanging fruit. I'd suggest someone with more knowledge of the codebase should go through and figure out why the fallback processing in the except block is failing, fix that, and then reverse it so the fallback processing (i.e. "load the whole file into memory") is the default and it only falls back to loading slice by slice if the whole file won't fit or is larger than some threshold or something like that. Or maybe better yet just change it so it loads much larger slices every now and then and divvies out smaller slices from those, only loading another large slice when the previous one has been depleted.

pyannote_hack

tl;dr applying the hack shown above to "pyannote/audio/core/io.py" allowed me to process a 1 hour long file in about 1 minute using a 4090, where before it was taking idk like half an hour or an hour or something.

Meowzz95 commented 1 year ago

@TheNiggler thanks a lot for your investigation and the hack you shared, I’ll definitely give it a try! Hope someone from the author team will take a look!

mllife commented 1 year ago

recently, sagemaker have updated the default pytorch kernel to py3.10 with cuda 11.8; so now pyannote is not working properly there

NikitaKononov commented 1 year ago

recently, sagemaker have updated the default pytorch kernel to py3.10 with cuda 11.8; so now pyannote is not working properly there

so what should we do

mllife commented 1 year ago

recently, sagemaker have updated the default pytorch kernel to py3.10 with cuda 11.8; so now pyannote is not working properly there

so what should we do

i could not push to the current repo for some reasons, so i forked the develop branch and added the suggested changes by @TheNiggler

pip install git+https://github.com/mllife/pyannote-audio-118.git@822db88f573d7923d921dac11486f713c1729a09

@NikitaKononov this seems to be working for now. thanks @TheNiggler

NikitaKononov commented 1 year ago

recently, sagemaker have updated the default pytorch kernel to py3.10 with cuda 11.8; so now pyannote is not working properly there

so what should we do

i could not push to the current repo for some reasons, so i forked the develop branch and added the suggested changes by @TheNiggler

pip install git+https://github.com/mllife/pyannote-audio-118.git@822db88f573d7923d921dac11486f713c1729a09

@NikitaKononov this seems to be working for now. thanks @TheNiggler

thank you very much

g588928812 commented 1 year ago

tl;dr applying the hack shown above to "pyannote/audio/core/io.py" allowed me to process a 1 hour long file in about 1 minute using a 4090, where before it was taking idk like half an hour or an hour or something.

can confirm. THANK YOU!

Filimoa commented 1 year ago

Also seeing massive performance regressions. Unfortunately @mllife fork doesn't seem to improve the situation for me. I'm adding a notebook that reproduces the issue, the 3 minute sample file included used to take 30 seconds and now takes upwards of 15 minutes on a GPU colab instance.

I've spent sometime downgrading dependencies but still unable to improve the situation.

Note: Using the forked version https://colab.research.google.com/drive/1-uizEpRXoiDlxcPjUlLe7H95h50iTUao?usp=sharing

hbredin commented 1 year ago

@Filimoa you should send the pipeline to GPU. It relies on CPU by default.

import torch
diarization_pipeline.to(torch.device("cuda"))

Filimoa commented 1 year ago

@hbredin

I'm so stupid thanks. Could have sworn I had this enabled.

Prakash2403 commented 1 year ago

I was getting

RuntimeError: Sizes of tensors must match except in dimension 0. Expected size
80000 but got size 76898 for tensor number 11 in the list.

after doing the fix @TheNiggler suggested. Figured out it was due to the fewer frames in the last chunk. Line 433 attempts to fix the issue by padding extra zeros but it wasn't working for me.

I had to explicitly add the following code after the hack. Here's the complete fix.

if not ('temp_hack_cache' in file):
    file['temp_hack_cache'], _ = torchaudio.load(file["audio"])
data = file['temp_hack_cache'][:, start_frame: end_frame]
curr_frames = len(data[0])
if curr_frames != num_frames:
    data = F.pad(data, pad=(0, num_frames - curr_frames))

miranda1000 commented 1 year ago

For me the problem was inferring a .mp3 instead of a .wav (from >1h to ~1min)

geronimi73 commented 1 year ago

It still uses nearly no GPU and a lot of CPU in my project.

try using it through whisperX, worked for me

zuverschenken commented 9 months ago

The fixes mentioned in this thread didn't noticeably increase my cuda utilisation. What did help is loading my files into memory before giving them to the pipeline instead of providing the pipeline with a file path. The guide is here under the heading "Processing a file from memory":

https://github.com/pyannote/pyannote-audio/blob/develop/tutorials/applying_a_pipeline.ipynb

jhmejia commented 8 months ago

The fixes mentioned in this thread didn't noticeably increase my cuda utilisation. What did help is loading my files into memory before giving them to the pipeline instead of providing the pipeline with a file path. The guide is here under the heading "Processing a file from memory":

https://github.com/pyannote/pyannote-audio/blob/develop/tutorials/applying_a_pipeline.ipynb

This is what worked for me to reduce my time from approx 18 minutes to around 10 seconds!

manish-kumar-iisc commented 8 months ago

import torchaudio waveform, sample_rate = torchaudio.load(AUDIO_FILE)

audio_in_memory = {"waveform": waveform, "sample_rate": sample_rate} pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization-3.1",use_auth_token=access_token) diarization, speaker_embeddings = pipeline(audio_in_memory, return_embeddings=True)

this helped me increased, with good gpu utilization, reduced inferencing time. thanks @jhmejia

stale[bot] commented 2 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

pyannote / pyannote-audio

Very low GPU usage(5%) and slow diarization #1403