Closed Meowzz95 closed 1 week ago
+1
RTX 4090 + Ryzen 5900X +64GB RAM Gpu utilization is <4%, CPU load ~80%
3 hours of speech processed for 1 hour and 14 minutes
I looked into this a bit just now and noticed that not only is CPU usage high but disk IO is high as well. Interrupting the running process a few times as a crude form of profiling reveals that the bottleneck seems to be mostly in "pyannote/audio/core/io.py", function "crop", where it calls torchaudio.load to load tiny little itty bitty slices of the file from disk, one by one... oddly, if you look at the except block below there, it looks like it's written such that if loading a slice fails, it loads the entire file into memory and caches it instead, which seems to me like it should obviously be the default behavior as it would be way faster and do you really usually process audio files so huge they won't fit into memory? Unfortunately, there seems to also be some other bug in this fallback processing that causes it to spit out an incorrectly sized tensor and fail further up the stack, so just forcing it to go into the "except" branch does not work. I didn't look into this bug completely but I'd guess it's due to the extra call to downmix_andresample which happens inside self.__call_\, or a side effect of having the "waveform" key inside the "file" object from there on out as that's referenced in a few places too. However, doing the same thing but just calling torchaudio.load directly and using a key other than "waveform" to cache it (see image), works as expected and gives a speed up of several orders of magnitude. Note that even after this CPU usage is still much higher than GPU, but the processing is quick enough (about a minute or so per hour of audio on my machine) that it doesn't really matter.
I'm too lazy to go through the github ritual of making a PR or whatever to fix this, but it seems like low hanging fruit. I'd suggest someone with more knowledge of the codebase should go through and figure out why the fallback processing in the except block is failing, fix that, and then reverse it so the fallback processing (i.e. "load the whole file into memory") is the default and it only falls back to loading slice by slice if the whole file won't fit or is larger than some threshold or something like that. Or maybe better yet just change it so it loads much larger slices every now and then and divvies out smaller slices from those, only loading another large slice when the previous one has been depleted.
tl;dr applying the hack shown above to "pyannote/audio/core/io.py" allowed me to process a 1 hour long file in about 1 minute using a 4090, where before it was taking idk like half an hour or an hour or something.
@TheNiggler thanks a lot for your investigation and the hack you shared, I’ll definitely give it a try! Hope someone from the author team will take a look!
recently, sagemaker have updated the default pytorch kernel to py3.10 with cuda 11.8; so now pyannote is not working properly there
recently, sagemaker have updated the default pytorch kernel to py3.10 with cuda 11.8; so now pyannote is not working properly there
so what should we do
recently, sagemaker have updated the default pytorch kernel to py3.10 with cuda 11.8; so now pyannote is not working properly there
so what should we do
i could not push to the current repo for some reasons, so i forked the develop branch and added the suggested changes by @TheNiggler
pip install git+https://github.com/mllife/pyannote-audio-118.git@822db88f573d7923d921dac11486f713c1729a09
@NikitaKononov this seems to be working for now. thanks @TheNiggler
recently, sagemaker have updated the default pytorch kernel to py3.10 with cuda 11.8; so now pyannote is not working properly there
so what should we do
i could not push to the current repo for some reasons, so i forked the develop branch and added the suggested changes by @TheNiggler
pip install git+https://github.com/mllife/pyannote-audio-118.git@822db88f573d7923d921dac11486f713c1729a09
@NikitaKononov this seems to be working for now. thanks @TheNiggler
thank you very much
tl;dr applying the hack shown above to "pyannote/audio/core/io.py" allowed me to process a 1 hour long file in about 1 minute using a 4090, where before it was taking idk like half an hour or an hour or something.
can confirm. THANK YOU!
Also seeing massive performance regressions. Unfortunately @mllife fork doesn't seem to improve the situation for me. I'm adding a notebook that reproduces the issue, the 3 minute sample file included used to take 30 seconds and now takes upwards of 15 minutes on a GPU colab instance.
I've spent sometime downgrading dependencies but still unable to improve the situation.
Note: Using the forked version https://colab.research.google.com/drive/1-uizEpRXoiDlxcPjUlLe7H95h50iTUao?usp=sharing
@Filimoa you should send the pipeline to GPU. It relies on CPU by default.
import torch
diarization_pipeline.to(torch.device("cuda"))
@hbredin
I'm so stupid thanks. Could have sworn I had this enabled.
I was getting
RuntimeError: Sizes of tensors must match except in dimension 0. Expected size
80000 but got size 76898 for tensor number 11 in the list.
after doing the fix @TheNiggler suggested. Figured out it was due to the fewer frames in the last chunk. Line 433 attempts to fix the issue by padding extra zeros but it wasn't working for me.
I had to explicitly add the following code after the hack. Here's the complete fix.
if not ('temp_hack_cache' in file):
file['temp_hack_cache'], _ = torchaudio.load(file["audio"])
data = file['temp_hack_cache'][:, start_frame: end_frame]
curr_frames = len(data[0])
if curr_frames != num_frames:
data = F.pad(data, pad=(0, num_frames - curr_frames))
For me the problem was inferring a .mp3 instead of a .wav (from >1h to ~1min)
It still uses nearly no GPU and a lot of CPU in my project.
try using it through whisperX, worked for me
The fixes mentioned in this thread didn't noticeably increase my cuda utilisation. What did help is loading my files into memory before giving them to the pipeline instead of providing the pipeline with a file path. The guide is here under the heading "Processing a file from memory":
https://github.com/pyannote/pyannote-audio/blob/develop/tutorials/applying_a_pipeline.ipynb
The fixes mentioned in this thread didn't noticeably increase my cuda utilisation. What did help is loading my files into memory before giving them to the pipeline instead of providing the pipeline with a file path. The guide is here under the heading "Processing a file from memory":
https://github.com/pyannote/pyannote-audio/blob/develop/tutorials/applying_a_pipeline.ipynb
This is what worked for me to reduce my time from approx 18 minutes to around 10 seconds!
import torchaudio waveform, sample_rate = torchaudio.load(AUDIO_FILE)
audio_in_memory = {"waveform": waveform, "sample_rate": sample_rate} pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization-3.1",use_auth_token=access_token) diarization, speaker_embeddings = pipeline(audio_in_memory, return_embeddings=True)
this helped me increased, with good gpu utilization, reduced inferencing time. thanks @jhmejia
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
I noticed that the diarization process is VERY slow, took about an hour for a 2-hour audio and it seems that the time taken is exponentially increasing against the audio length.
I wanted to find out where the bottleneck is so I observe the CPU and GPU usage.
It seems that GPU is under a very light load while CPU usage is almost full(100% used for a core).
Does anyone know how we can make more utilization of the GPU while doing diarization? Thanks!