speaker-diarization-3.1 using 0% gpu

MEPO29 commented 10 months ago

I'm using Google Colab and it is utilizing 0% gpu. Sometimes it uses 100% for a second and then it goes back to 0%. The audio is about 1.5 hours long. Is this normal behaviour? Keep in mind CPU is at 100% almost all of the time.

github-actions[bot] commented 10 months ago

Thank you for your issue.You might want to check the FAQ if you haven't done so already.

Feel free to close this issue if you found an answer in the FAQ.

If your issue is a feature request, please read this first and update your request accordingly, if needed.

If your issue is a bug report, please provide a minimum reproducible example as a link to a self-contained Google Colab notebook containing everthing needed to reproduce the bug:

installation
data preparation
model download
etc.

Providing an MRE will increase your chance of getting an answer from the community (either maintainers or other power users).

Companies relying on pyannote.audio in production may contact me via email regarding:

paid scientific consulting around speaker diarization and speech processing in general;
custom models and tailored features (via the local tech transfer office).

This is an automated reply, generated by FAQtory

petros94 commented 10 months ago

Same here

MohammedAlhajji commented 10 months ago

I somehow fixed this by using

pip uninstall pyannote.audio
conda install -c conda-forge pyannote.core
pip install pyannote.audio

on my M3

KennethTrinh commented 10 months ago

this worked for me after uninstalling onnxruntime and onnxruntime-gpu pip install optimum[onnxruntime-gpu] See this link: https://huggingface.co/docs/optimum/main/en/onnxruntime/usage_guides/gpu#installation

arnavmehta7 commented 10 months ago

@KennethTrinh I don't understand how that'll work, because new version isn't dependent on onnx.

pourmand1376 commented 10 months ago

@KennethTrinh I tested your solution and It didn't work for me.

KennethTrinh commented 10 months ago

oops I was running in a different environment than you guys (was in ec2 instance) - apologies! I tried to reproduce in a colab notebook with a plain old T4 gpu:

Non-blocking code to poll the gpu for usage and memory (I don't have cloud shell since I'm poor!) - run this first if you don't have cloud shell

import threading
import subprocess
import time

def run_nvidia_smi():
    while True:
        try:
            output = subprocess.check_output(
                ['nvidia-smi', '--query-gpu=timestamp,utilization.gpu,utilization.memory', '--format=csv']
            )

            output_str = output.decode('utf-8').strip()
            with open('output.log', 'a') as f:
                f.write(output_str + '\n')
        except subprocess.CalledProcessError as e:
            print(f"Error running nvidia-smi: {e}")
        time.sleep(1)

thread = threading.Thread(target=run_nvidia_smi)
thread.start()

Code to run the diarization - don't forget to define your `TOKEN` beforehand

!pip install -q --upgrade pyannote.audio
!pip install -q transformers==4.35.2
!pip install -q datasets

import os
import torch
import soundfile as sf
import json
import transformers

from pyannote.audio import Pipeline
from datasets import load_dataset

DIARIZATION_MODEL = "pyannote/speaker-diarization-3.1"
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

concatenated_librispeech = load_dataset("sanchit-gandhi/concatenated_librispeech", split="train", streaming=True)
sample = next(iter(concatenated_librispeech))

diarization_pipeline = Pipeline.from_pretrained(
        DIARIZATION_MODEL, 
        use_auth_token = TOKEN,
    ).to(device)

input_tensor = torch.from_numpy(sample['audio']['array']).float().unsqueeze(0) # (channel, time)
output = diarization_pipeline({"waveform": input_tensor, "sample_rate": sample['audio']['sampling_rate']})
output

My logs show that the gpu was indeed used (albeit very little, but my audio was (347360,) samples , so may change with your audio). The key difference is that I'm passing in a torch tensor of shape (channel, time) , but you guys are just passing the .wav file?

timestamp, utilization.gpu [%], utilization.memory [%]
2023/11/27 19:34:24.171, 1 %, 0 %
timestamp, utilization.gpu [%], utilization.memory [%]
2023/11/27 19:34:25.188, 6 %, 1 %
timestamp, utilization.gpu [%], utilization.memory [%]
2023/11/27 19:34:26.208, 5 %, 0 %

jaffee commented 8 months ago

I found that when passing a filename to the speaker diarization pipeline, I got very poor performance. Upon profiling I discovered this was due to many many calls to get_torchaudio_info and torchaudio._backend.ffmpeg._load_audio_fileobj. This indicated to me that the file was being reprocessed many times unnecessarily (this was all coming from the "crop" method). I noticed that there were very different codepaths if the incoming file object already had a "waveform" computed, so I did the following:

import torchaudio

waveform, sample_rate = torchaudio.load("segment_0.wav")
audio_file = {"waveform": waveform, "sample_rate": sample_rate}

and then I passed audio_file to my pipeline. This took my runtime from 5m to 13s.

I suspect it would be straightforward to change the code to perform this step internally initially and save the user the trouble... and also probably simplify a lot of the downstream code as it could just assume it always had a waveform.

rmeissn commented 8 months ago

https://github.com/pyannote/pyannote-audio/issues/1557#issuecomment-1922466847 (comment above) solved it also for me, using an eGPU and cuda.

stale[bot] commented 2 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

pyannote / pyannote-audio