m-bain / whisperX

WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization)
BSD 2-Clause "Simplified" License
11.33k stars 1.19k forks source link

Use whisperx and pyannote in Colab without HuggingFace token #841

Open biagioscalingipsy opened 1 month ago

biagioscalingipsy commented 1 month ago

Hello! I would like to use WhisperX and Pyannote to combine automatic transcription and diarization. I can do it on Colab using the Huggingface (HF) token, but I would like to avoid entering the HF token every time. So I was thinking of downloading them locally and loading them when needed. I can do this for WhisperX but not for Pyannote. Therefore, I can do the automatic transcription but cannot proceed with the annotation. I followed these tutorials on how to use Pyannote and the offline pipelines and downloaded everything, but it still doesn't work. Can you help me?

I download pytorch_model.bin and configuration.yaml for voice_activity_detection, then yaml files for segmentation and speaker_diarization and put them in working directory. Then I used this code:

!pip install whisperx
import whisperx
import gc

device = "cuda"
batch_size = 4 # reduce if low on GPU mem
compute_type = "float16" # change to "int8" if low on GPU mem (may reduce accuracy)

audio_file = "audio.wav"
audio = whisperx.load_audio(audio_file)
device = "cuda"
compute_type = "float16" # change to "int8" if low on GPU mem (may reduce accuracy)
batch_size= 16 # reduce if low on GPU mem
model_name = "large-v2"

model_dir = "/content/drive/MyDrive/whisperx_models/large-v2"
model = whisperx.load_model(model_name, device, compute_type=compute_type, download_root=model_dir)

audio = whisperx.load_audio(audio_file)
result = model.transcribe(audio, batch_size=batch_size)
print(result["segments"]) # before alignment

model_a, metadata = whisperx.load_align_model(language_code=result["language"], device=device)
result = whisperx.align(result["segments"], model_a, metadata, audio, device, return_char_alignments=False)

print(result["segments"]) # after alignment

!pip install pyannote.audio

pipeline = Pipeline.from_pretrained("/content/drive/MyDrive/speaker_diarization.yaml")

# Apply diarization
diarization_result = pipeline(audio_file)
Manamama commented 1 month ago

They seem to be downloaded once and stay put, so work offline. Download it once via the simple whisperx --diarize ... , as per the instructions here, then check where they are (Linux: find ~/.cache -type f -size +1M -mmin -60 ), usually they land here:

ls ~/.cache/torch/pyannote/
models--pyannote--segmentation-3.0  models--pyannote--speaker-diarization-3.1  models--pyannote--wespeaker-voxceleb-resnet34-LM

and that is it, the Huggingface (HF) token not needed anymore (as all offline by then).