Output Files aren't being created with on (long) 3-speaker diarization transcripts

When transcribing & diarizing podcasts with WhisperX, on several different podcasts, I've encountered that WhisperX won't create any output files (.srt, .vtt, etc).

In these cases, the below factors have been true: Note: I'm not saying that this is the scope of the problem, as I have not tested all permutations – I'm simply noticing that there may be a problem if one or more of these factors are in play.

min_speakers = 3 and max_speakers = 3.
- No problems, so far, if both values are <=2. I have not tested with >3. I have not tested with differing values in each of the variables.
Long podcasts (2+ h). I have not tested with shorter podcasts when using 3 speakers.
I am running WhisperX in Google Colab.
I am running WhisperX 3.1.1, but had the same problem in 2.x.x, or at least recent versions of it.
I am running these PyTorch versions: !ltt install torch==2.0.0 torchvision==0.15.0 torchaudio==2.0.0 --pytorch-computation-backend=cu118
I don't run out of Google Colab Memory or Disk space when transcribing. (NB: Google Colab GPU resources seem to drain much faster, than if the number of speakers is 2).

Here is my standard command:

initial_prompt_argument = f'--initial_prompt "{initial_prompt}"' if initial_prompt is not None else ""

!whisperx "{audio_path_wav}" \
--task "transcribe" \
--model "{model_name}" \
--language "{language}" \
--output_dir "{dir_whisper}" \
--device "cuda" \
--align_model "WAV2VEC2_ASR_LARGE_LV60K_960H" \
--diarize \
--min_speakers "{num_speakers}" \
--max_speakers "{num_speakers}" \
--output_format "srt" \
--highlight_words True \
--hf_token "{hf_token}" \
--verbose True \
{initial_prompt_argument}

I have also tried the command without including output_dir, align_model or output_format (collectively). The same issue occurs. There are then no files created of any kind (all,srt,vtt,txt,tsv,json) in the default output_dir (./)

Does anyone have an idea of what the issue might be?

I can provide more background details if needed – including accompanying code – but I'm not sure if relevant, because I use the exact same Google Colab Notebook for all cases, but just switch the value in num_speakers (passed to min_speakers & max_speakers). Beside that, I alter the input audio file, and again, the Notebook works fine with <=2 speakers and will create the .srt file as intended.

I will attempt to examine the search space for this problem further, but deeply appreciate any troubleshooting.

Here's what the console log can look like:

>>Performing transcription...
Downloading: "https://download.pytorch.org/torchaudio/models/wav2vec2_fairseq_base_ls960_asr_ls960.pth" to /root/.cache/torch/hub/checkpoints/wav2vec2_fairseq_base_ls960_asr_ls960.pth
100% 360M/360M [00:01<00:00, 246MB/s]
>>Performing alignment...
>>Performing diarization...
Downloading (…)olve/2.1/config.yaml: 100% 500/500 [00:00<00:00, 2.75MB/s]
Downloading pytorch_model.bin: 100% 17.7M/17.7M [00:00<00:00, 61.8MB/s]
Downloading (…)/2022.07/config.yaml: 100% 318/318 [00:00<00:00, 1.94MB/s]
Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.0.2. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint --file ../root/.cache/torch/pyannote/models--pyannote--segmentation/snapshots/c4c8ceafcbb3a7a280c2d357aee9fbc9b0be7f9b/pytorch_model.bin`
Model was trained with pyannote.audio 0.0.1, yours is 2.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.0.0+cu118. Bad things might happen unless you revert torch to 1.x.
Downloading (…)ain/hyperparams.yaml: 100% 1.92k/1.92k [00:00<00:00, 12.8MB/s]
Downloading embedding_model.ckpt: 100% 83.3M/83.3M [00:00<00:00, 188MB/s]
Downloading (…)an_var_norm_emb.ckpt: 100% 1.92k/1.92k [00:00<00:00, 14.7MB/s]
Downloading classifier.ckpt: 100% 5.53M/5.53M [00:00<00:00, 229MB/s]
Downloading (…)in/label_encoder.txt: 100% 129k/129k [00:00<00:00, 5.23MB/s]
^C

# < Script continues after !whisperx command, and attempts to find the .srt file for further processing >

WhisperX Transcription DONE!
File not found. Looking for the file...
File not found. Looking for the file...
File not found. Looking for the file...
File not found. Looking for the file...

# < Loop continues >

The ^C is a bit odd. Do you always use a keyboard interrupt or does the script finish without it?

You are correct, very odd. Neither my script or I manually use keyboard interrupt. Either this is a "feature" of WhisperX or Google Colab, for some reason?

When I run the exact same script, but with 2 speakers on another podcast, the log except for running WhisperX is:

Running WhisperX Command...
2023-05-25 11:06:29.355568: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Downloading model.bin:   1% 21.0M/3.09G [00:00<00:15, 194MB/s]
Downloading (…)37e8b/vocabulary.txt:   0% 0.00/460k [00:00<?, ?B/s]

Downloading (…)08837e8b/config.json: 100% 2.80k/2.80k [00:00<00:00, 14.4MB/s]
Downloading model.bin:   1% 41.9M/3.09G [00:00<00:15, 196MB/s]

Downloading model.bin:   4% 126M/3.09G [00:00<00:10, 285MB/s] 
Downloading (…)37e8b/vocabulary.txt: 100% 460k/460k [00:00<00:00, 1.08MB/s]
Downloading model.bin:  12% 367M/3.09G [00:01<00:09, 280MB/s]

Downloading (…)37e8b/tokenizer.json: 100% 2.20M/2.20M [00:01<00:00, 1.99MB/s]
Downloading model.bin: 100% 3.09G/3.09G [00:20<00:00, 147MB/s] 
100%|█████████████████████████████████████| 16.9M/16.9M [00:02<00:00, 6.53MiB/s]
Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.0.2. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint --file ../root/.cache/torch/whisperx-vad-segmentation.bin`
Model was trained with pyannote.audio 0.0.1, yours is 2.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.0.0+cu118. Bad things might happen unless you revert torch to 1.x.
>>Performing transcription...
Downloading: "https://download.pytorch.org/torchaudio/models/wav2vec2_fairseq_base_ls960_asr_ls960.pth" to /root/.cache/torch/hub/checkpoints/wav2vec2_fairseq_base_ls960_asr_ls960.pth
100% 360M/360M [00:02<00:00, 148MB/s]
>>Performing alignment...
>>Performing diarization...
Downloading (…)olve/2.1/config.yaml: 100% 500/500 [00:00<00:00, 2.79MB/s]
Downloading pytorch_model.bin: 100% 17.7M/17.7M [00:00<00:00, 261MB/s]
Downloading (…)/2022.07/config.yaml: 100% 318/318 [00:00<00:00, 1.81MB/s]
Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.0.2. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint --file ../root/.cache/torch/pyannote/models--pyannote--segmentation/snapshots/c4c8ceafcbb3a7a280c2d357aee9fbc9b0be7f9b/pytorch_model.bin`
Model was trained with pyannote.audio 0.0.1, yours is 2.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.0.0+cu118. Bad things might happen unless you revert torch to 1.x.
Downloading (…)ain/hyperparams.yaml: 100% 1.92k/1.92k [00:00<00:00, 10.9MB/s]
Downloading embedding_model.ckpt: 100% 83.3M/83.3M [00:00<00:00, 308MB/s]
Downloading (…)an_var_norm_emb.ckpt: 100% 1.92k/1.92k [00:00<00:00, 11.3MB/s]
Downloading classifier.ckpt: 100% 5.53M/5.53M [00:00<00:00, 410MB/s]
Downloading (…)in/label_encoder.txt: 100% 129k/129k [00:00<00:00, 29.3MB/s]

WhisperX Transcription DONE!

# < Resulting .SRT file is found by my script and becomes printed below >

SPEAKER_00: [00:00:00] Like it or not, to avoid this problem……

Notice that the ^C is not present when finishing the transcript for 2 speakers. The script completes successfully as intended.

Here is a failed 3-speaker transcription process, for comparison of console outputs:

Running WhisperX Command...
2023-05-25 11:47:48.163299: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Downloading model.bin:   0% 0.00/3.09G [00:00<?, ?B/s]
Downloading (…)08837e8b/config.json: 100% 2.80k/2.80k [00:00<00:00, 11.5MB/s]

Downloading (…)37e8b/vocabulary.txt: 100% 460k/460k [00:00<00:00, 29.3MB/s]
Downloading model.bin:   2% 73.4M/3.09G [00:00<00:08, 355MB/s]
Downloading model.bin:   6% 199M/3.09G [00:00<00:08, 328MB/s]
Downloading (…)37e8b/tokenizer.json: 100% 2.20M/2.20M [00:00<00:00, 5.25MB/s]
Downloading model.bin: 100% 3.09G/3.09G [00:19<00:00, 160MB/s] 
100%|██████████████████████████████████████| 16.9M/16.9M [00:00<00:00, 115MiB/s]
Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.0.2. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint --file ../root/.cache/torch/whisperx-vad-segmentation.bin`
Model was trained with pyannote.audio 0.0.1, yours is 2.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.0.0+cu118. Bad things might happen unless you revert torch to 1.x.
>>Performing transcription...
Downloading: "https://download.pytorch.org/torchaudio/models/wav2vec2_fairseq_base_ls960_asr_ls960.pth" to /root/.cache/torch/hub/checkpoints/wav2vec2_fairseq_base_ls960_asr_ls960.pth
100% 360M/360M [00:06<00:00, 54.1MB/s]
>>Performing alignment...
>>Performing diarization...
Downloading (…)olve/2.1/config.yaml: 100% 500/500 [00:00<00:00, 2.70MB/s]
Downloading pytorch_model.bin: 100% 17.7M/17.7M [00:00<00:00, 282MB/s]
Downloading (…)/2022.07/config.yaml: 100% 318/318 [00:00<00:00, 1.88MB/s]
Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.0.2. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint --file ../root/.cache/torch/pyannote/models--pyannote--segmentation/snapshots/c4c8ceafcbb3a7a280c2d357aee9fbc9b0be7f9b/pytorch_model.bin`
Model was trained with pyannote.audio 0.0.1, yours is 2.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.0.0+cu118. Bad things might happen unless you revert torch to 1.x.
Downloading (…)ain/hyperparams.yaml: 100% 1.92k/1.92k [00:00<00:00, 9.07MB/s]
Downloading embedding_model.ckpt: 100% 83.3M/83.3M [00:00<00:00, 320MB/s]
Downloading (…)an_var_norm_emb.ckpt: 100% 1.92k/1.92k [00:00<00:00, 10.4MB/s]
Downloading classifier.ckpt: 100% 5.53M/5.53M [00:00<00:00, 396MB/s]
Downloading (…)in/label_encoder.txt: 100% 129k/129k [00:00<00:00, 1.51MB/s]
^C

WhisperX Transcription DONE!

File not found. Looking for the file...
File not found. Looking for the file...
File not found. Looking for the file...
File not found. Looking for the file...

NB: The difference in line count between the two outputs seem to be caused by how the downloads of model.bin are variably printed to the console. The contents seem to be the same otherwise, except for the ^C in the failed output?

If it isn't WhisperX causing the ostensible "keyboard interrupt" ^C, I'm not sure if it's caused by Google Colab crashing or running out of GPU or other resources, but I don't think so atm. Will continue testing.

m-bain / whisperX

Output Files aren't being created with on (long) 3-speaker diarization transcripts #272