m-bain / whisperX

WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization)
BSD 2-Clause "Simplified" License
11.74k stars 1.24k forks source link

How to "read" the outputs? How to turn everything into an SRT file? #692

Open APISeeker opened 7 months ago

APISeeker commented 7 months ago

Hello I tried whisperX and I see that the outputs are so much crowded, you have like a JSON I think, inside other dictionaries? There is so much, I think there is the timestamp for a sentence, then timestamp for each word right? Do to turn that into a format that can be understood by vide editing softwares? (for instance SRT files etc) Is there anything or algorithm or method that can help please?

Thx

Khaztaroth commented 7 months ago

I don't know if you are implementing it in python or using the command line interface. However if you are running it in CLI you can pass the argument --output_format srt to get a subtitle file.

APISeeker commented 7 months ago

Hey @Khaztaroth I am using the "inference" example from here: "

import whisperx
import gc 

device = "cuda" 
audio_file = "audio.mp3"
batch_size = 16 # reduce if low on GPU mem
compute_type = "float16" # change to "int8" if low on GPU mem (may reduce accuracy)

# 1. Transcribe with original whisper (batched)
model = whisperx.load_model("large-v2", device, compute_type=compute_type)

# save model to local path (optional)
# model_dir = "/path/"
# model = whisperx.load_model("large-v2", device, compute_type=compute_type, download_root=model_dir)

audio = whisperx.load_audio(audio_file)
result = model.transcribe(audio, batch_size=batch_size)
print(result["segments"]) # before alignment

# delete model if low on GPU resources
# import gc; gc.collect(); torch.cuda.empty_cache(); del model

# 2. Align whisper output
model_a, metadata = whisperx.load_align_model(language_code=result["language"], device=device)
result = whisperx.align(result["segments"], model_a, metadata, audio, device, return_char_alignments=False)

print(result["segments"]) # after alignment

# delete model if low on GPU resources
# import gc; gc.collect(); torch.cuda.empty_cache(); del model_a

# 3. Assign speaker labels
diarize_model = whisperx.DiarizationPipeline(use_auth_token=YOUR_HF_TOKEN, device=device)

# add min/max number of speakers if known
diarize_segments = diarize_model(audio)
# diarize_model(audio, min_speakers=min_speakers, max_speakers=max_speakers)

result = whisperx.assign_word_speakers(diarize_segments, result)
print(diarize_segments)
print(result["segments"]) # segments are now assigned speaker IDs

" Then do py inference.py Will your argument work?

APISeeker commented 7 months ago

Alos where can I find list of arguments? Where did you find it and learn about this argument actually? Thanks

APISeeker commented 7 months ago

HEllo it worked, by using another code direlcyt, but ID DID NOT KEEP SPEAKER ID @Khaztaroth ? I want both speakers and srt at the same time!!

Khaztaroth commented 7 months ago

Sadly I don't know much how it works when used from a python file. It's not necessary for the work I do. The list of arguments can be found from the console if you use whisperx --help

You could just use the command prompt by inputting whisperx FILE --diarize --hf_token TOKEN --output_format srt

There's instructions for how to get the token on the repo's readme file. https://github.com/m-bain/whisperX?tab=readme-ov-file#speaker-diarization

APISeeker commented 7 months ago

Hello @Khaztaroth yes but the SRT outputl file, will it contain the name of the speakers or not?

meera commented 7 months ago

Does any know how to write the output to VTT?

This is what I tried

with open("output-x.vtt", "w", encoding="utf-8") as vtt_file: vtt_writer = whisperx.utils.WriteVTT(output_dir=".") vtt_writer.write_result(result, file=vtt_file, options={"max_line_width":None,"max_line_count":None,"highlight_words":False})

Gives key error

KeyError: 'language'

lukaskellerstein commented 6 months ago

@meera

I was able to make it work. The important row is diarize_result["language"] = result["language"].

import whisperx
import gc
from dotenv import load_dotenv, find_dotenv
import os
from whisperx.utils import get_writer
import torch

_ = load_dotenv(find_dotenv())  # read local .env file

API_TOKEN = os.environ.get("HF_TOKEN")
print(API_TOKEN)

# -----------------------------------------
# Parameters
# -----------------------------------------

file = "my_file"

device = "cuda"
audio_file = f"mp3/{file}.mp3"
batch_size = 1  # reduce if low on GPU mem
compute_type = "int8"  # change to "int8" if low on GPU mem (may reduce accuracy)
whisper_size = "base"

# -----------------------------------------
# Model
# -----------------------------------------

# 1. Transcribe with original whisper (batched)
model = whisperx.load_model(whisper_size, device, compute_type=compute_type)

# save model to local path (optional)
# model_dir = "/path/"
# model = whisperx.load_model("large-v2", device, compute_type=compute_type, download_root=model_dir)

# -----------------------------------------
# Transcription
# -----------------------------------------

audio = whisperx.load_audio(audio_file)
result = model.transcribe(audio, batch_size=batch_size)

# # Save as an TXT file
# srt_writer = get_writer("txt", "captions/")
# srt_writer(result, audio_file, {})

# # Save as an SRT file
# srt_writer = get_writer("srt", "captions/")
# srt_writer(
#     result,
#     audio_file,
#     {"max_line_width": None, "max_line_count": None, "highlight_words": False},
# )

# # Save as a VTT file
# vtt_writer = get_writer("vtt", "captions/")
# vtt_writer(
#     result,
#     audio_file,
#     {"max_line_width": None, "max_line_count": None, "highlight_words": False},
# )

# # Save as a TSV file
# tsv_writer = get_writer("tsv", "captions/")
# tsv_writer(result, audio_file, {})

# # Save as a JSON file
# json_writer = get_writer("json", "captions/")
# json_writer(result, audio_file, {})

print("############### before alignment ###############")
print(result["segments"])  # before alignment

# delete model if low on GPU resources
import gc

gc.collect()
torch.cuda.empty_cache()
del model

# -----------------------------------------
# ???
# -----------------------------------------

# 2. Align whisper output
model_a, metadata = whisperx.load_align_model(
    language_code=result["language"], device=device
)
aligned_result = whisperx.align(
    result["segments"],
    model_a,
    metadata,
    audio,
    device,
    return_char_alignments=False,
)

print("############### after alignment ###############")
print(aligned_result)  # after alignment

# delete model if low on GPU resources
import gc

gc.collect()
torch.cuda.empty_cache()
del model_a

# -----------------------------------------
# Diarization
# -----------------------------------------

# 3. Assign speaker labels
diarize_model = whisperx.DiarizationPipeline(use_auth_token=API_TOKEN, device=device)

# add min/max number of speakers if known
diarize_segments = diarize_model(audio)
# diarize_model(audio, min_speakers=min_speakers, max_speakers=max_speakers)

diarize_result = whisperx.assign_word_speakers(diarize_segments, aligned_result)
# print(diarize_segments)

print("############### with speaker ID ###############")
print(diarize_result)  # segments are now assigned speaker IDs

# -----------------------------------------
# SAVE INTO FILES
# -----------------------------------------
# with open(f"captions/{file}_DIARIZED.json", "w") as f:
#     json.dump(result, f, indent=4)

# Save as a VTT file
diarize_result["language"] = result["language"]
vtt_writer = get_writer("vtt", "captions2/")
vtt_writer(
    diarize_result,
    audio_file,
    {"max_line_width": None, "max_line_count": None, "highlight_words": False},
)

I took inspiration from this issue.