Open shanky100 opened 1 year ago
I am also interest in this. I haven't found a way to do it in the --help options yet. I wonder if it would also be possible to group speaker segments together. So it captured all of the things speaker 1 says in a segment/paragraph until another speaker starts.
I'm sure there is a way for me to do some post-processing sed/awk witchcraft if I had to to bundle the sentences together.
Agreed, this would be fantastic for me as well - making it human readable, at least a little bit, woudl be a huge help.
I am also interest in this. I haven't found a way to do it in the --help options yet. I wonder if it would also be possible to group speaker segments together. So it captured all of the things speaker 1 says in a segment/paragraph until another speaker starts.
I'm sure there is a way for me to do some post-processing sed/awk witchcraft if I had to to bundle the sentences together.
Just found a work around for this. You can convert the word SRT and sentence SRT into there respective CSV, and then use the logic based on the time on both the csv to get the respective speaker label from the word level CSV which can be further included with the sentences. Please let me know if this helps.
That does help, thanks! If you have an example of how you did this, I'd greatly appreciate it. I am miserable with that kind of thing.
That does help, thanks! If you have an example of how you did this, I'd greatly appreciate it. I am miserable with that kind of thing.
@cybersandwich I was looking for the same output format as you so I followed this blog post (https://towardsdatascience.com/unlock-the-power-of-audio-data-advanced-transcription-and-diarization-with-whisper-whisperx-and-ed9424307281) and after copying and pasting the code (and adding the imports) I got good results. here is my code:
from whisper import load_model
from whisperx import load_align_model, align
from whisperx.transcribe import DiarizationPipeline, assign_word_speakers
def transcribe(audio_file: str, model_name: str, device: str = "cpu"):
model = load_model(model_name, device)
result = model.transcribe(audio_file)
language_code = result["language"]
return {
"segments": result["segments"],
"language_code": language_code,
}
def align_segments(
segments: list[dict[str, any]],
language_code: str,
audio_file: str,
device: str = "cpu",
):
model_a, metadata = load_align_model(language_code=language_code, device=device)
result_aligned = align(segments, model_a, metadata, audio_file, device)
return result_aligned
def diarize(audio_file: str, hf_token: str) -> dict[str, any]:
diarization_pipeline = DiarizationPipeline(use_auth_token=hf_token)
diarization_result = diarization_pipeline(audio_file)
return diarization_result
def assign_speakers(
diarization_result: dict[str, any], aligned_segments: dict[str, any]
) :
result_segments, word_seg = assign_word_speakers(
diarization_result, aligned_segments["segments"]
)
results_segments_w_speakers: list[dict[str, any]] = []
for result_segment in result_segments:
results_segments_w_speakers.append(
{
"start": result_segment["start"],
"end": result_segment["end"],
"text": result_segment["text"],
"speaker": result_segment["speaker"],
}
)
return results_segments_w_speakers
def transcribe_and_diarize(
audio_file: str,
hf_token: str,
model_name: str,
device: str = "cpu",
):
transcript = transcribe(audio_file, model_name, device)
aligned_segments = align_segments(
transcript["segments"], transcript["language_code"], audio_file, device
)
diarization_result = diarize(audio_file, hf_token)
results_segments_w_speakers = assign_speakers(diarization_result, aligned_segments)
merged_segments = merge_segments(results_segments_w_speakers)
# Print the results in a user-friendly way
for i, segment in enumerate(merged_segments):
print(f"Segment {i + 1}:")
print(f"Start time: {segment['start']:.2f}")
print(f"End time: {segment['end']:.2f}")
print(f"Speaker: {segment['speaker']}")
print(f"Transcript: {segment['text']}")
print("")
return merged_segments
def merge_segments(results_segments_w_speakers):
merged_segments = []
for segment in results_segments_w_speakers:
if merged_segments and merged_segments[-1]['speaker'] == segment['speaker']:
# If the speaker of this segment is the same as the last one,
# we extend the last segment to include this one.
merged_segments[-1]['end'] = segment['end']
merged_segments[-1]['text'] += ' ' + segment['text']
else:
# Otherwise, we start a new segment.
merged_segments.append(segment)
return merged_segments
# Print the results in a user-friendly way
transcribe_and_diarize("test.wav", hf_token, 'small', 'cpu')
And here is a sample of the output:
Start time: 54.44
End time: 62.00
Speaker: SPEAKER_01
Transcript: I don't know what you had to tell him that for. You put me in a very difficult position, marine biologist. I'm very uncomfortable with this whole thing.
Segment 12:
Start time: 62.89
End time: 65.00
Speaker: SPEAKER_00
Transcript: Yeah, with all due respect, I would think it's right up your alley.
Segment 13:
Start time: 65.69
End time: 76.96
Speaker: SPEAKER_01
Transcript: Well, it's not up my alley. It's one thing if I make it up. I know what I'm doing. I know my alleys. You've got me in the Galapagos Islands living with the turtles. I don't know where the hell I am.
Segment 14:
Start time: 78.19
End time: 82.00
Speaker: SPEAKER_00
Transcript: Well, you came in the other day with all that whale stuff, the squeaking and the squealing and...
Segment 15:
Start time: 82.69
End time: 104.80
Speaker: SPEAKER_01
Transcript: Why couldn't you make me an architect? You know I always wanted to pretend that I was an architect. Well, I'm supposed to see it tomorrow. I'm going to tell her what's going on. I mean, maybe she just likes me for me. Your parents must be so proud of you, George. Oh, they're busting. What are those people doing over there?
Segment 16:
Start time: 105.85
End time: 110.88
Speaker: SPEAKER_00
Transcript: What's going on over here? There's a beached whale. She's dying. Does anyone hear a marine biologist?
Hope it helps somehow.
Its running now. I assume if I change "cpu" to "gpu" across the board it will offload to my gfx card?
Its running now. I assume if I change "cpu" to "gpu" across the board it will offload to my gfx card?
If you want to offload to the GPU You'll need something like CUDA installed, if it is you can pass "cuda" as the device. be mindful about offloading the bigger models to the GPU as it will run out of memory.
Its running now. I assume if I change "cpu" to "gpu" across the board it will offload to my gfx card?
If you want to offload to the GPU You'll need something like CUDA installed, if it is you can pass "cuda" as the device. be mindful about offloading the bigger models to the GPU as it will run out of memory.
This is fantastic Leekao. I got it working with cuda and it's doing exactly what I need it to do. Thanks for sharing.
That does help, thanks! If you have an example of how you did this, I'd greatly appreciate it. I am miserable with that kind of thing.
@cybersandwich I was looking for the same output format as you so I followed this blog post (https://towardsdatascience.com/unlock-the-power-of-audio-data-advanced-transcription-and-diarization-with-whisper-whisperx-and-ed9424307281) and after copying and pasting the code (and adding the imports) I got good results. here is my code:
from whisper import load_model from whisperx import load_align_model, align from whisperx.transcribe import DiarizationPipeline, assign_word_speakers def transcribe(audio_file: str, model_name: str, device: str = "cpu"): model = load_model(model_name, device) result = model.transcribe(audio_file) language_code = result["language"] return { "segments": result["segments"], "language_code": language_code, } def align_segments( segments: list[dict[str, any]], language_code: str, audio_file: str, device: str = "cpu", ): model_a, metadata = load_align_model(language_code=language_code, device=device) result_aligned = align(segments, model_a, metadata, audio_file, device) return result_aligned def diarize(audio_file: str, hf_token: str) -> dict[str, any]: diarization_pipeline = DiarizationPipeline(use_auth_token=hf_token) diarization_result = diarization_pipeline(audio_file) return diarization_result def assign_speakers( diarization_result: dict[str, any], aligned_segments: dict[str, any] ) : result_segments, word_seg = assign_word_speakers( diarization_result, aligned_segments["segments"] ) results_segments_w_speakers: list[dict[str, any]] = [] for result_segment in result_segments: results_segments_w_speakers.append( { "start": result_segment["start"], "end": result_segment["end"], "text": result_segment["text"], "speaker": result_segment["speaker"], } ) return results_segments_w_speakers def transcribe_and_diarize( audio_file: str, hf_token: str, model_name: str, device: str = "cpu", ): transcript = transcribe(audio_file, model_name, device) aligned_segments = align_segments( transcript["segments"], transcript["language_code"], audio_file, device ) diarization_result = diarize(audio_file, hf_token) results_segments_w_speakers = assign_speakers(diarization_result, aligned_segments) merged_segments = merge_segments(results_segments_w_speakers) # Print the results in a user-friendly way for i, segment in enumerate(merged_segments): print(f"Segment {i + 1}:") print(f"Start time: {segment['start']:.2f}") print(f"End time: {segment['end']:.2f}") print(f"Speaker: {segment['speaker']}") print(f"Transcript: {segment['text']}") print("") return merged_segments def merge_segments(results_segments_w_speakers): merged_segments = [] for segment in results_segments_w_speakers: if merged_segments and merged_segments[-1]['speaker'] == segment['speaker']: # If the speaker of this segment is the same as the last one, # we extend the last segment to include this one. merged_segments[-1]['end'] = segment['end'] merged_segments[-1]['text'] += ' ' + segment['text'] else: # Otherwise, we start a new segment. merged_segments.append(segment) return merged_segments # Print the results in a user-friendly way transcribe_and_diarize("test.wav", hf_token, 'small', 'cpu')
And here is a sample of the output:
Start time: 54.44 End time: 62.00 Speaker: SPEAKER_01 Transcript: I don't know what you had to tell him that for. You put me in a very difficult position, marine biologist. I'm very uncomfortable with this whole thing. Segment 12: Start time: 62.89 End time: 65.00 Speaker: SPEAKER_00 Transcript: Yeah, with all due respect, I would think it's right up your alley. Segment 13: Start time: 65.69 End time: 76.96 Speaker: SPEAKER_01 Transcript: Well, it's not up my alley. It's one thing if I make it up. I know what I'm doing. I know my alleys. You've got me in the Galapagos Islands living with the turtles. I don't know where the hell I am. Segment 14: Start time: 78.19 End time: 82.00 Speaker: SPEAKER_00 Transcript: Well, you came in the other day with all that whale stuff, the squeaking and the squealing and... Segment 15: Start time: 82.69 End time: 104.80 Speaker: SPEAKER_01 Transcript: Why couldn't you make me an architect? You know I always wanted to pretend that I was an architect. Well, I'm supposed to see it tomorrow. I'm going to tell her what's going on. I mean, maybe she just likes me for me. Your parents must be so proud of you, George. Oh, they're busting. What are those people doing over there? Segment 16: Start time: 105.85 End time: 110.88 Speaker: SPEAKER_00 Transcript: What's going on over here? There's a beached whale. She's dying. Does anyone hear a marine biologist?
Hope it helps somehow.
I tried the code you provided, but it gave an error. I think it should be caused by the version problem. Could you please provide the version you used or modify the code? Thank you very much
Traceback (most recent call last):
File "/home/lli/whisperpython39/leekao.py", line 89, in
I get this error when executing the script provided by @Leekao :
transcript_segments = transcript_result["segments"]
TypeError: list indices must be integers or slices, not str
I get the same error. were you able to resolve it
I am using the below command to generate output
whisperx MY_AUDIO_FILE --model base.en --diarize --hf_token MY_TOKEN --output_format all
In word level SRT i am able to see speaker labels , but in sentence level SRT only start time end time and transcription is comming.
Can someone help me with this, how can I get speaker labels in sentence level SRT.