m-bain / whisperX

WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization)
BSD 2-Clause "Simplified" License
12.17k stars 1.29k forks source link

Diarization high memory usage not using dedicated gpu #594

Open Khaztaroth opened 11 months ago

Khaztaroth commented 11 months ago

Darization runs very slowly, uses almost 12gb of memory, and is seemingly not happening on the GPU (GPUz and Window's task manager show conflicting info)

image

On interrupting the diarization step, the last call shows the following segment of code, it points to something happening on the CPU but I'm not sure if it's the main process. Admittedly I don't understand python code very well.

(whisperx) PS DIRECTORY> whisperx "Return to The Obra Dinn Ep1.opus" --model large-v2 --align_model WAV2VEC2_ASR_LARGE_LV60K_960H --batch_size 4 --task transcribe --lang en --diarize --hf_token TOKEN
The torchaudio backend is switched to 'soundfile'. Note that 'sox_io' is not supported on Windows.
torchvision is not available - cannot save figures
The torchaudio backend is switched to 'soundfile'. Note that 'sox_io' is not supported on Windows.
Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.1.2. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint C:\Users\Victo\.cache\torch\whisperx-vad-segmentation.bin`
Model was trained with pyannote.audio 0.0.1, yours is 3.1.0. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.0.0. Bad things might happen unless you revert torch to 1.x.
>>Performing transcription...
>>Performing alignment...
>>Performing diarization...
Traceback (most recent call last):
  File "C:\Users\USER\anaconda3\envs\whisperx\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\USER\anaconda3\envs\whisperx\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "C:\Users\USER\anaconda3\envs\whisperx\Scripts\whisperx.exe\__main__.py", line 7, in <module>
  File "C:\Users\USER\anaconda3\envs\whisperx\lib\site-packages\whisperx\transcribe.py", line 220, in cli
    diarize_segments = diarize_model(input_audio_path, min_speakers=min_speakers, max_speakers=max_speakers)
  File "C:\Users\USER\anaconda3\envs\whisperx\lib\site-packages\whisperx\diarize.py", line 28, in __call__
    segments = self.model(audio_data, min_speakers=min_speakers, max_speakers=max_speakers)
  File "C:\Users\USER\anaconda3\envs\whisperx\lib\site-packages\pyannote\audio\core\pipeline.py", line 325, in __call__
    return self.apply(file, **kwargs)
  File "C:\Users\USER\anaconda3\envs\whisperx\lib\site-packages\pyannote\audio\pipelines\speaker_diarization.py", line 514, in apply
    embeddings = self.get_embeddings(
  File "C:\Users\USER\anaconda3\envs\whisperx\lib\site-packages\pyannote\audio\pipelines\speaker_diarization.py", line 349, in get_embeddings
    embedding_batch: np.ndarray = self._embedding(
  File "C:\Users\USER\anaconda3\envs\whisperx\lib\site-packages\pyannote\audio\pipelines\speaker_verification.py", line 709, in __call__
    return embeddings.cpu().numpy()
KeyboardInterrupt

Extra testing:

It seems that, at least in my particular setup, the diarization model couldn't access the dedicated gpu over the integrated one. Setting my system to only use the dedicated GPU for everything ensured that it ran on it.

image

Memory usage is still high, and it takes much longer than previously. However those could very well be issues with the diarization model and not whisperx's implementation.

Khaztaroth commented 11 months ago

Extra extra testing:

Naively I had updated to the latest version of Pytorch through pip rather than conda. I'm not sure what the difference is under the hood, since it doesn't throw any errors or warning when running Whisper. However it causes diarization to take several hours longer and use 3x the memory.

Creating the environment from scratch making sure to use conda for Pytorch yielded the expected results.

Khaztaroth commented 11 months ago

A side effect of this seems to be that WhisperX can't be used outside of a conda environment, preventing it from being comfortably integrated into things like Subtitle Edit, which can now use Whisper and it's variants to automatically create subtitles.

All different ways of installing/using WhisperX and running it from a default windows prompt has the same problem of not correctly using the GPU for transcriptions or diarization, In Subtitle Edit, it returns a single period character instead of a proper transcription like vanilla Whisper or even Faster-Whisper.

I know this repo is more of a proof of concept than a tool that is intended for mass-use, However it does consistently yield results that I'm more happy with than other Whisper forks. It would be useful for it to work outside of a conda environment.