victor-upmeet / whisperx-replicate

14 stars 26 forks source link

Diarization breaks at replicate. Not diarizing, works. #7

Closed xmontero closed 1 week ago

xmontero commented 1 week ago

Hi! I'm using replicate to try out the diarization.

I'm using a file of 1'15" to test. It's an mp3 converted from an AMR file from a phone voice call recording.

When running without language (to force detection), all the default parameters, and alignment and diarization set to false, it works.

The file is in catalan ("ca") and it properly detects it.

When running it setting alignment + diarization to true, it breaks. I get this output:

Logs:

No language specified, language will be first be detected for each audio file (increases inference time). Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.4.0. To apply the upgrade to your files permanently, run python -m pytorch_lightning.utilities.upgrade_checkpoint ../root/.cache/torch/whisperx-vad-segmentation.bin Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x. Model was trained with torch 1.10.0+cu102, yours is 2.1.0+cu121. Bad things might happen unless you revert torch to 1.x. Detected language: ca (0.92) in first 30s of audio... Could not download 'pyannote/speaker-diarization-3.1' pipeline. It might be because the pipeline is private or gated so make sure to authenticate. Visit https://hf.co/settings/tokens to create your access token and retry with:

Pipeline.from_pretrained('pyannote/speaker-diarization-3.1', ... use_auth_token=YOUR_AUTH_TOKEN) If this still does not work, it might be because the pipeline is gated: visit https://hf.co/pyannote/speaker-diarization-3.1 to accept the user conditions. Traceback (most recent call last): File "/root/.pyenv/versions/3.11.8/lib/python3.11/site-packages/cog/server/worker.py", line 354, in _predict result = predict(**payload) ^^^^^^^^^^^^^^^^^^ File "/src/predict.py", line 167, in predict result = diarize(audio, result, debug, huggingface_access_token, min_speakers, max_speakers) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/src/predict.py", line 291, in diarize diarize_model = whisperx.DiarizationPipeline(use_auth_token=huggingface_access_token, device=device) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/.pyenv/versions/3.11.8/lib/python3.11/site-packages/whisperx/diarize.py", line 19, in init self.model = Pipeline.from_pretrained(model_name, use_auth_token=use_auth_token).to(device) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ AttributeError: 'NoneType' object has no attribute 'to'

It breaks both in victor-upmeet/whisperx and victor-upmeet/whisperx-a40-large

What do I have to do to use diarization?

victor-upmeet commented 1 week ago

You need to specify a hugging face access token in the parameters of the prediction. The hugging face account the access token is attached to must have access to https://huggingface.co/pyannote/segmentation-3.0 and https://huggingface.co/pyannote/speaker-diarization-3.1