m-bain / whisperX

WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization)
BSD 2-Clause "Simplified" License
11.24k stars 1.18k forks source link

min_speakers=2, max_speakers=2 --> Getting 4 speakers? I am new to this #677

Open APISeeker opened 7 months ago

APISeeker commented 7 months ago

Hello, I am new to this, How do you make sure it only find 2 speakers? I am trying to whisper a video where 2 persons speak but I am gettings results with multiple speakers I used this code: diarize_model(audio, min_speakers=2, max_speakers=2)

But I get this kind of results: .....[{'word': 'Welcome', 'start': 0.663, 'end': 0.903, 'score': 0.906, 'speaker': 'SPEAKER_00'}, {'word': 'back.', 'start': 0.943, 'end': 1.143, 'score': 0.928, 'speaker': 'SPEAKER_00'}], 'speaker': 'SPEAKER_00'}, {'start': 2.184, 'end': 2.745, 'text': 'Here we go again.', 'words': [{'word': 'Here', 'start': 2.184, 'end': 2.324, 'score': 0.766, 'speaker': 'SPEAKER_03'}

Notice how I have SPeaker 00 and 03 already! Elsewhere, I alsohave Speaker_02 and 01.. so that's 4 speakers, whereas my video has only 2!

Whay am I doing wrong? Full code:

import whisperx
import gc 

device = "cuda" 
audio_file = "audio...xxx.. .mp3"
batch_size = 16 # reduce if low on GPU mem
compute_type = "float16" # change to "int8" if low on GPU mem (may reduce accuracy)

# 1. Transcribe with original whisper (batched)
model = whisperx.load_model("large-v2", device, compute_type=compute_type)

# save model to local path (optional)
# model_dir = "/path/"
# model = whisperx.load_model("large-v2", device, compute_type=compute_type, download_root=model_dir)

audio = whisperx.load_audio(audio_file)
result = model.transcribe(audio, batch_size=batch_size)
print(result["segments"]) # before alignment

# delete model if low on GPU resources
# import gc; gc.collect(); torch.cuda.empty_cache(); del model

# 2. Align whisper output
model_a, metadata = whisperx.load_align_model(language_code=result["language"], device=device)
result = whisperx.align(result["segments"], model_a, metadata, audio, device, return_char_alignments=False)

print(result["segments"]) # after alignment

# delete model if low on GPU resources
# import gc; gc.collect(); torch.cuda.empty_cache(); del model_a

#EXTRA FROM ME: https://huggingface.co/pyannote/speaker-diarization-3.1

# 3. Assign speaker labels
# diarize_model = whisperx.DiarizationPipeline(use_auth_token=YOUR_HF_TOKEN, device=device)

diarize_model = whisperx.DiarizationPipeline(use_auth_token="..........", device=device)

# add min/max number of speakers if known
diarize_segments = diarize_model(audio)
# diarize_model(audio, min_speakers=min_speakers, max_speakers=max_speakers)
diarize_model(audio, min_speakers=2, max_speakers=2)

result = whisperx.assign_word_speakers(diarize_segments, result)
print(diarize_segments)
print(result["segments"]) # segments are now assigned speaker IDs

fileResult="Result1.txt"
with open(fileResult, "w", encoding='utf-8') as r_file:
    r_file.write(str(diarize_segments))
fileResult="Result2.txt"
with open(fileResult, "w", encoding='utf-8') as r_file:
    r_file.write(str(result))
fileResult="Result3.txt"    
with open(fileResult, "w", encoding='utf-8') as r_file:
    r_file.write(str(result["segments"]))
abhi2596 commented 7 months ago

I am not sure why you are getting 4 speakers but if you know the exact number of speakers then you can use num_speakers = 2 instead of min_speakers and max_speakers so it would be diarize_model(audio, num_speakers=2) https://github.com/m-bain/whisperX/blob/main/whisperx/diarize.py - line 28 also refer Controlling number of speakers here https://huggingface.co/pyannote/speaker-diarization-3.1

APISeeker commented 7 months ago

Thanks a lo, I will try it.

ChristianSch commented 1 month ago

note that the CLI does not offer num_speakers.