m-bain / whisperX

WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization)
BSD 2-Clause "Simplified" License
12.13k stars 1.29k forks source link

Speaker labels not appearing in sentence level srt when using diarize option #197

Open shanky100 opened 1 year ago

shanky100 commented 1 year ago

I am using the below command to generate output

whisperx MY_AUDIO_FILE --model base.en --diarize --hf_token MY_TOKEN --output_format all

In word level SRT i am able to see speaker labels , but in sentence level SRT only start time end time and transcription is comming.

Can someone help me with this, how can I get speaker labels in sentence level SRT.

wesgould commented 1 year ago

I am also interest in this. I haven't found a way to do it in the --help options yet. I wonder if it would also be possible to group speaker segments together. So it captured all of the things speaker 1 says in a segment/paragraph until another speaker starts.

I'm sure there is a way for me to do some post-processing sed/awk witchcraft if I had to to bundle the sentences together.

rhambus commented 1 year ago

Agreed, this would be fantastic for me as well - making it human readable, at least a little bit, woudl be a huge help.

shanky100 commented 1 year ago

I am also interest in this. I haven't found a way to do it in the --help options yet. I wonder if it would also be possible to group speaker segments together. So it captured all of the things speaker 1 says in a segment/paragraph until another speaker starts.

I'm sure there is a way for me to do some post-processing sed/awk witchcraft if I had to to bundle the sentences together.

Just found a work around for this. You can convert the word SRT and sentence SRT into there respective CSV, and then use the logic based on the time on both the csv to get the respective speaker label from the word level CSV which can be further included with the sentences. Please let me know if this helps.

rhambus commented 1 year ago

That does help, thanks! If you have an example of how you did this, I'd greatly appreciate it. I am miserable with that kind of thing.

Leekao commented 1 year ago

That does help, thanks! If you have an example of how you did this, I'd greatly appreciate it. I am miserable with that kind of thing.

@cybersandwich I was looking for the same output format as you so I followed this blog post (https://towardsdatascience.com/unlock-the-power-of-audio-data-advanced-transcription-and-diarization-with-whisper-whisperx-and-ed9424307281) and after copying and pasting the code (and adding the imports) I got good results. here is my code:

from  whisper import load_model
from  whisperx import load_align_model, align
from whisperx.transcribe import DiarizationPipeline, assign_word_speakers

def transcribe(audio_file: str, model_name: str, device: str = "cpu"):
    model = load_model(model_name, device)
    result = model.transcribe(audio_file)

    language_code = result["language"]
    return {
        "segments": result["segments"],
        "language_code": language_code,

def align_segments(
    segments: list[dict[str, any]],
    language_code: str,
    audio_file: str,
    device: str = "cpu",
    model_a, metadata = load_align_model(language_code=language_code, device=device)
    result_aligned = align(segments, model_a, metadata, audio_file, device)
    return result_aligned

def diarize(audio_file: str, hf_token: str) -> dict[str, any]:
    diarization_pipeline = DiarizationPipeline(use_auth_token=hf_token)
    diarization_result = diarization_pipeline(audio_file)
    return diarization_result

def assign_speakers(
    diarization_result: dict[str, any], aligned_segments: dict[str, any]
) :
    result_segments, word_seg = assign_word_speakers(
        diarization_result, aligned_segments["segments"]
    results_segments_w_speakers: list[dict[str, any]] = []
    for result_segment in result_segments:
                "start": result_segment["start"],
                "end": result_segment["end"],
                "text": result_segment["text"],
                "speaker": result_segment["speaker"],
    return results_segments_w_speakers

def transcribe_and_diarize(
    audio_file: str,
    hf_token: str,
    model_name: str,
    device: str = "cpu",
    transcript = transcribe(audio_file, model_name, device)
    aligned_segments = align_segments(
        transcript["segments"], transcript["language_code"], audio_file, device
    diarization_result = diarize(audio_file, hf_token)
    results_segments_w_speakers = assign_speakers(diarization_result, aligned_segments)
    merged_segments = merge_segments(results_segments_w_speakers)
    # Print the results in a user-friendly way
    for i, segment in enumerate(merged_segments):
        print(f"Segment {i + 1}:")
        print(f"Start time: {segment['start']:.2f}")
        print(f"End time: {segment['end']:.2f}")
        print(f"Speaker: {segment['speaker']}")
        print(f"Transcript: {segment['text']}")

    return merged_segments

def merge_segments(results_segments_w_speakers):
    merged_segments = []
    for segment in results_segments_w_speakers:
        if merged_segments and merged_segments[-1]['speaker'] == segment['speaker']:
            # If the speaker of this segment is the same as the last one,
            # we extend the last segment to include this one.
            merged_segments[-1]['end'] = segment['end']
            merged_segments[-1]['text'] += ' ' + segment['text']
            # Otherwise, we start a new segment.
    return merged_segments

# Print the results in a user-friendly way

transcribe_and_diarize("test.wav", hf_token, 'small', 'cpu')

And here is a sample of the output:

Start time: 54.44
End time: 62.00
Speaker: SPEAKER_01
Transcript:  I don't know what you had to tell him that for.  You put me in a very difficult position, marine biologist.  I'm very uncomfortable with this whole thing.

Segment 12:
Start time: 62.89
End time: 65.00
Speaker: SPEAKER_00
Transcript:  Yeah, with all due respect,  I would think it's right up your alley.

Segment 13:
Start time: 65.69
End time: 76.96
Speaker: SPEAKER_01
Transcript:  Well, it's not up my alley.  It's one thing if I make it up.  I know what I'm doing.  I know my alleys.  You've got me in the Galapagos Islands  living with the turtles.  I don't know where the hell I am.

Segment 14:
Start time: 78.19
End time: 82.00
Speaker: SPEAKER_00
Transcript:  Well, you came in the other day with all that whale stuff,  the squeaking and the squealing and...

Segment 15:
Start time: 82.69
End time: 104.80
Speaker: SPEAKER_01
Transcript:  Why couldn't you make me an architect?  You know I always wanted to pretend that I was an architect.  Well, I'm supposed to see it tomorrow.  I'm going to tell her what's going on.  I mean, maybe she just likes me for me.  Your parents must be so proud of you, George.  Oh, they're busting.  What are those people doing over there?

Segment 16:
Start time: 105.85
End time: 110.88
Speaker: SPEAKER_00
Transcript:  What's going on over here?  There's a beached whale. She's dying.  Does anyone hear a marine biologist?

Hope it helps somehow.

wesgould commented 1 year ago

Its running now. I assume if I change "cpu" to "gpu" across the board it will offload to my gfx card?

Leekao commented 1 year ago

Its running now. I assume if I change "cpu" to "gpu" across the board it will offload to my gfx card?

If you want to offload to the GPU You'll need something like CUDA installed, if it is you can pass "cuda" as the device. be mindful about offloading the bigger models to the GPU as it will run out of memory.

wesgould commented 1 year ago

Its running now. I assume if I change "cpu" to "gpu" across the board it will offload to my gfx card?

If you want to offload to the GPU You'll need something like CUDA installed, if it is you can pass "cuda" as the device. be mindful about offloading the bigger models to the GPU as it will run out of memory.

This is fantastic Leekao. I got it working with cuda and it's doing exactly what I need it to do. Thanks for sharing.

JiangN6 commented 10 months ago

That does help, thanks! If you have an example of how you did this, I'd greatly appreciate it. I am miserable with that kind of thing.

@cybersandwich I was looking for the same output format as you so I followed this blog post (https://towardsdatascience.com/unlock-the-power-of-audio-data-advanced-transcription-and-diarization-with-whisper-whisperx-and-ed9424307281) and after copying and pasting the code (and adding the imports) I got good results. here is my code:

from  whisper import load_model
from  whisperx import load_align_model, align
from whisperx.transcribe import DiarizationPipeline, assign_word_speakers

def transcribe(audio_file: str, model_name: str, device: str = "cpu"):
    model = load_model(model_name, device)
    result = model.transcribe(audio_file)

    language_code = result["language"]
    return {
        "segments": result["segments"],
        "language_code": language_code,

def align_segments(
    segments: list[dict[str, any]],
    language_code: str,
    audio_file: str,
    device: str = "cpu",
    model_a, metadata = load_align_model(language_code=language_code, device=device)
    result_aligned = align(segments, model_a, metadata, audio_file, device)
    return result_aligned

def diarize(audio_file: str, hf_token: str) -> dict[str, any]:
    diarization_pipeline = DiarizationPipeline(use_auth_token=hf_token)
    diarization_result = diarization_pipeline(audio_file)
    return diarization_result

def assign_speakers(
    diarization_result: dict[str, any], aligned_segments: dict[str, any]
) :
    result_segments, word_seg = assign_word_speakers(
        diarization_result, aligned_segments["segments"]
    results_segments_w_speakers: list[dict[str, any]] = []
    for result_segment in result_segments:
                "start": result_segment["start"],
                "end": result_segment["end"],
                "text": result_segment["text"],
                "speaker": result_segment["speaker"],
    return results_segments_w_speakers

def transcribe_and_diarize(
    audio_file: str,
    hf_token: str,
    model_name: str,
    device: str = "cpu",
    transcript = transcribe(audio_file, model_name, device)
    aligned_segments = align_segments(
        transcript["segments"], transcript["language_code"], audio_file, device
    diarization_result = diarize(audio_file, hf_token)
    results_segments_w_speakers = assign_speakers(diarization_result, aligned_segments)
    merged_segments = merge_segments(results_segments_w_speakers)
    # Print the results in a user-friendly way
    for i, segment in enumerate(merged_segments):
        print(f"Segment {i + 1}:")
        print(f"Start time: {segment['start']:.2f}")
        print(f"End time: {segment['end']:.2f}")
        print(f"Speaker: {segment['speaker']}")
        print(f"Transcript: {segment['text']}")

    return merged_segments

def merge_segments(results_segments_w_speakers):
    merged_segments = []
    for segment in results_segments_w_speakers:
        if merged_segments and merged_segments[-1]['speaker'] == segment['speaker']:
            # If the speaker of this segment is the same as the last one,
            # we extend the last segment to include this one.
            merged_segments[-1]['end'] = segment['end']
            merged_segments[-1]['text'] += ' ' + segment['text']
            # Otherwise, we start a new segment.
    return merged_segments

# Print the results in a user-friendly way

transcribe_and_diarize("test.wav", hf_token, 'small', 'cpu')

And here is a sample of the output:

Start time: 54.44
End time: 62.00
Speaker: SPEAKER_01
Transcript:  I don't know what you had to tell him that for.  You put me in a very difficult position, marine biologist.  I'm very uncomfortable with this whole thing.

Segment 12:
Start time: 62.89
End time: 65.00
Speaker: SPEAKER_00
Transcript:  Yeah, with all due respect,  I would think it's right up your alley.

Segment 13:
Start time: 65.69
End time: 76.96
Speaker: SPEAKER_01
Transcript:  Well, it's not up my alley.  It's one thing if I make it up.  I know what I'm doing.  I know my alleys.  You've got me in the Galapagos Islands  living with the turtles.  I don't know where the hell I am.

Segment 14:
Start time: 78.19
End time: 82.00
Speaker: SPEAKER_00
Transcript:  Well, you came in the other day with all that whale stuff,  the squeaking and the squealing and...

Segment 15:
Start time: 82.69
End time: 104.80
Speaker: SPEAKER_01
Transcript:  Why couldn't you make me an architect?  You know I always wanted to pretend that I was an architect.  Well, I'm supposed to see it tomorrow.  I'm going to tell her what's going on.  I mean, maybe she just likes me for me.  Your parents must be so proud of you, George.  Oh, they're busting.  What are those people doing over there?

Segment 16:
Start time: 105.85
End time: 110.88
Speaker: SPEAKER_00
Transcript:  What's going on over here?  There's a beached whale. She's dying.  Does anyone hear a marine biologist?

Hope it helps somehow.

I tried the code you provided, but it gave an error. I think it should be caused by the version problem. Could you please provide the version you used or modify the code? Thank you very much

Traceback (most recent call last): File "/home/lli/whisperpython39/leekao.py", line 89, in sub = transcribe_and_diarize("test1.wav", "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", 'small', 'cpu') File "/home/lli/whisperpython39/leekao.py", line 60, in transcribe_and_diarize results_segments_w_speakers = assign_speakers(diarization_result, aligned_segments) File "/home/lli/whisperpython39/leekao.py", line 34, in assign_speakers result_segments, word_seg = assign_word_speakers( File "/home/lli/whisperpython39/venv/lib/python3.9/site-packages/whisperx/diarize.py", line 36, in assign_word_speakers transcript_segments = transcript_result["segments"] TypeError: list indices must be integers or slices, not str

alexauvray commented 10 months ago

I get this error when executing the script provided by @Leekao :

    transcript_segments = transcript_result["segments"]
TypeError: list indices must be integers or slices, not str
omarsiddiqi224 commented 9 months ago

I get the same error. were you able to resolve it