pyannote / pyannote-audio

Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding
http://pyannote.github.io
MIT License
5.57k stars 733 forks source link

Pyannote is 10 times slower than WhisperX with GPU utilization 10%: expected behavior or misconfiguration? #1652

Open chubin opened 5 months ago

chubin commented 5 months ago

Tested versions

pyannote.audio==3.1.1
pyannote.core==5.0.0
pyannote.database==5.0.1
pyannote.metrics==3.2.1
pyannote.pipeline==3.0.1

System information

Ubuntu 22.04, NVIDIA RTX A6000

Issue description

I am not sure if it is a bug, so please feel free to close it if it is expected behavior.

I am trying to diarize a large recording (approximately 60 minutes), and the diarization process takes 8.5 minutes:

real    8m40,982s
user    8m12,687s
sys     1m21,703s

Here is my code:

import torch
from pyannote.audio import Pipeline

pipeline = Pipeline.from_pretrained(
    "pyannote/speaker-diarization-3.1", use_auth_token=hf_token
    )

pipeline.to(torch.device("cuda"))

diarization = pipeline("audio.wav")

It uses the GPU during diarization, but with a low utilization level (~10%), and it uses 1 core of the CPU (100%) all the time.

When doing the diarization with whisperx, though, it takes just a minute, and GPU utilization is at full capacity.

However, the quality of diarization is slightly worse in this case (approximately 5% of text is attributed to wrong/non-existent speakers).

           duration   GPU-usage
pyannote   520.5s     10%
whisperx    75.0s     100%

Pyannote diarization quality is just brilliant, but it takes an order of magnitude more time.

I suppose that I am doing something wrong, but I don't know what exactly.

Could you please point me in the right direction, or just say that it is exactly as it should be, and the behavior is expected.

GPU utilization while using pyannote pure

pyannote

GPU utilization when using whisperX

whisperX

Minimal reproduction example (MRE)

(not applicable)

hbredin commented 5 months ago

Would you mind sharing a link to a Google Colab that one can just click and run to reproduce the issue?

chubin commented 5 months ago

Unfortunately, I have no access to Google Colab from my Google Account (I can create a new account if needed), but as you can see the code is trivial.

I noticed that the problem disappears, when I load the audio file using Audio:

from pyannote.audio import Audio
io = Audio(mono='downmix', sample_rate=16000)
waveform, sample_rate = io("audio.mp3")

diarization = pipeline({"waveform": waveform, "sample_rate": sample_rate})

instead of loading audio.wav directly. The wav file (audio.wav) has the same sample rate (16000) though.

hbredin commented 5 months ago

The code might be "trivial" but the whole point of sharing a Google Colab is for pyannote maintainers to avoid wasting time on problems that are not reproducible.

For instance, two files with two different extensions (.wav and .mp3) are mentioned here. It is not clear which one works and which one fails.

Preparing a Google Colab will definitely increase your chances of having someone look at your issue. It might also happen that the mere preparation of the Google Colab makes you realize that the problem is on your side (I am not saying that this is the case here but it happened in the past).

DerEchteFeuerpfeil commented 5 months ago

+1 for this issue

thanks for the note @chubin , I have used your solution with

io = Audio(mono='downmix', sample_rate=16000)
waveform, sample_rate = io("audio.mp3")

diarization = pipeline({"waveform": waveform, "sample_rate": sample_rate})

and got much faster inference 👍

ahmetkipkip commented 2 months ago

Unfortunately, I have no access to Google Colab from my Google Account (I can create a new account if needed), but as you can see the code is trivial.

I noticed that the problem disappears, when I load the audio file using Audio:

from pyannote.audio import Audio
io = Audio(mono='downmix', sample_rate=16000)
waveform, sample_rate = io("audio.mp3")

diarization = pipeline({"waveform": waveform, "sample_rate": sample_rate})

instead of loading audio.wav directly. The wav file (audio.wav) has the same sample rate (16000) though.

Wow, after updatin from 2.x to 3.x I had performance issues. Now It's better than old code. I really didn't get what caused that but..

Thanks