Open APISeeker opened 7 months ago
I don't know if you are implementing it in python or using the command line interface. However if you are running it in CLI you can pass the argument --output_format srt
to get a subtitle file.
Hey @Khaztaroth I am using the "inference" example from here: "
import whisperx
import gc
device = "cuda"
audio_file = "audio.mp3"
batch_size = 16 # reduce if low on GPU mem
compute_type = "float16" # change to "int8" if low on GPU mem (may reduce accuracy)
# 1. Transcribe with original whisper (batched)
model = whisperx.load_model("large-v2", device, compute_type=compute_type)
# save model to local path (optional)
# model_dir = "/path/"
# model = whisperx.load_model("large-v2", device, compute_type=compute_type, download_root=model_dir)
audio = whisperx.load_audio(audio_file)
result = model.transcribe(audio, batch_size=batch_size)
print(result["segments"]) # before alignment
# delete model if low on GPU resources
# import gc; gc.collect(); torch.cuda.empty_cache(); del model
# 2. Align whisper output
model_a, metadata = whisperx.load_align_model(language_code=result["language"], device=device)
result = whisperx.align(result["segments"], model_a, metadata, audio, device, return_char_alignments=False)
print(result["segments"]) # after alignment
# delete model if low on GPU resources
# import gc; gc.collect(); torch.cuda.empty_cache(); del model_a
# 3. Assign speaker labels
diarize_model = whisperx.DiarizationPipeline(use_auth_token=YOUR_HF_TOKEN, device=device)
# add min/max number of speakers if known
diarize_segments = diarize_model(audio)
# diarize_model(audio, min_speakers=min_speakers, max_speakers=max_speakers)
result = whisperx.assign_word_speakers(diarize_segments, result)
print(diarize_segments)
print(result["segments"]) # segments are now assigned speaker IDs
" Then do py inference.py Will your argument work?
Alos where can I find list of arguments? Where did you find it and learn about this argument actually? Thanks
HEllo it worked, by using another code direlcyt, but ID DID NOT KEEP SPEAKER ID @Khaztaroth ? I want both speakers and srt at the same time!!
Sadly I don't know much how it works when used from a python file. It's not necessary for the work I do. The list of arguments can be found from the console if you use whisperx --help
You could just use the command prompt by inputting whisperx FILE --diarize --hf_token TOKEN --output_format srt
There's instructions for how to get the token on the repo's readme file. https://github.com/m-bain/whisperX?tab=readme-ov-file#speaker-diarization
Hello @Khaztaroth yes but the SRT outputl file, will it contain the name of the speakers or not?
Does any know how to write the output to VTT?
This is what I tried
with open("output-x.vtt", "w", encoding="utf-8") as vtt_file: vtt_writer = whisperx.utils.WriteVTT(output_dir=".") vtt_writer.write_result(result, file=vtt_file, options={"max_line_width":None,"max_line_count":None,"highlight_words":False})
Gives key error
KeyError: 'language'
@meera
I was able to make it work. The important row is diarize_result["language"] = result["language"]
.
import whisperx
import gc
from dotenv import load_dotenv, find_dotenv
import os
from whisperx.utils import get_writer
import torch
_ = load_dotenv(find_dotenv()) # read local .env file
API_TOKEN = os.environ.get("HF_TOKEN")
print(API_TOKEN)
# -----------------------------------------
# Parameters
# -----------------------------------------
file = "my_file"
device = "cuda"
audio_file = f"mp3/{file}.mp3"
batch_size = 1 # reduce if low on GPU mem
compute_type = "int8" # change to "int8" if low on GPU mem (may reduce accuracy)
whisper_size = "base"
# -----------------------------------------
# Model
# -----------------------------------------
# 1. Transcribe with original whisper (batched)
model = whisperx.load_model(whisper_size, device, compute_type=compute_type)
# save model to local path (optional)
# model_dir = "/path/"
# model = whisperx.load_model("large-v2", device, compute_type=compute_type, download_root=model_dir)
# -----------------------------------------
# Transcription
# -----------------------------------------
audio = whisperx.load_audio(audio_file)
result = model.transcribe(audio, batch_size=batch_size)
# # Save as an TXT file
# srt_writer = get_writer("txt", "captions/")
# srt_writer(result, audio_file, {})
# # Save as an SRT file
# srt_writer = get_writer("srt", "captions/")
# srt_writer(
# result,
# audio_file,
# {"max_line_width": None, "max_line_count": None, "highlight_words": False},
# )
# # Save as a VTT file
# vtt_writer = get_writer("vtt", "captions/")
# vtt_writer(
# result,
# audio_file,
# {"max_line_width": None, "max_line_count": None, "highlight_words": False},
# )
# # Save as a TSV file
# tsv_writer = get_writer("tsv", "captions/")
# tsv_writer(result, audio_file, {})
# # Save as a JSON file
# json_writer = get_writer("json", "captions/")
# json_writer(result, audio_file, {})
print("############### before alignment ###############")
print(result["segments"]) # before alignment
# delete model if low on GPU resources
import gc
gc.collect()
torch.cuda.empty_cache()
del model
# -----------------------------------------
# ???
# -----------------------------------------
# 2. Align whisper output
model_a, metadata = whisperx.load_align_model(
language_code=result["language"], device=device
)
aligned_result = whisperx.align(
result["segments"],
model_a,
metadata,
audio,
device,
return_char_alignments=False,
)
print("############### after alignment ###############")
print(aligned_result) # after alignment
# delete model if low on GPU resources
import gc
gc.collect()
torch.cuda.empty_cache()
del model_a
# -----------------------------------------
# Diarization
# -----------------------------------------
# 3. Assign speaker labels
diarize_model = whisperx.DiarizationPipeline(use_auth_token=API_TOKEN, device=device)
# add min/max number of speakers if known
diarize_segments = diarize_model(audio)
# diarize_model(audio, min_speakers=min_speakers, max_speakers=max_speakers)
diarize_result = whisperx.assign_word_speakers(diarize_segments, aligned_result)
# print(diarize_segments)
print("############### with speaker ID ###############")
print(diarize_result) # segments are now assigned speaker IDs
# -----------------------------------------
# SAVE INTO FILES
# -----------------------------------------
# with open(f"captions/{file}_DIARIZED.json", "w") as f:
# json.dump(result, f, indent=4)
# Save as a VTT file
diarize_result["language"] = result["language"]
vtt_writer = get_writer("vtt", "captions2/")
vtt_writer(
diarize_result,
audio_file,
{"max_line_width": None, "max_line_count": None, "highlight_words": False},
)
I took inspiration from this issue.
Hello I tried whisperX and I see that the outputs are so much crowded, you have like a JSON I think, inside other dictionaries? There is so much, I think there is the timestamp for a sentence, then timestamp for each word right? Do to turn that into a format that can be understood by vide editing softwares? (for instance SRT files etc) Is there anything or algorithm or method that can help please?
Thx