m-bain / whisperX

WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization)
BSD 4-Clause "Original" or "Old" License
9.98k stars 1.04k forks source link

Convert srt with --highlight_word format to normal srt (or vtt) #701

Open aindilis opened 4 months ago

aindilis commented 4 months ago

Hi,

I have SRT using --highlight_word output like so:

https://github.com/m-bain/whisperX/issues/539

I would like to convert files in this format using the CLI to files in a format where the individual words are not highlighted, and one gets a normal (but speaker diarized, and properly timestamped) output, which might look like:

3
00:00:04,578 --> 00:00:08,040
[SPEAKER_00]: So, first, is there anything you want to know about me first?

...

First, I was wondering if such a script already existed, or whether I should write one.

Second, does the individually word-tagged diarized .srt format have an actual name?

Third, I'd like to be able to go in and identify the speakers after the fact,a and put their names in in place of [SPEAKER_00]. Is there a tool for that, or should I also write one.

Thanks,

Andrew

abhi2596 commented 2 months ago
def convert_time_format(start_time, end_time):
    start_seconds = int(start_time)
    start_minutes = start_seconds // 60
    start_seconds %= 60
    start_milliseconds = int((start_time - int(start_time)) * 1000)

    end_seconds = int(end_time)
    end_minutes = end_seconds // 60
    end_seconds %= 60
    end_milliseconds = int((end_time - int(end_time)) * 1000)

    return f"{start_minutes:02d}:{start_seconds:02d}.{start_milliseconds:03d} --> {end_minutes:02d}:{end_seconds:02d}.{end_milliseconds:03d}"

def write_to_result(result, file_path):
    mode = "w"
    with open(file_path, mode, encoding="utf-8") as f:
        f.write("WEBVTT")
        f.write("\n\n")
        for segment in result["segments"]:
            f.write(f'{convert_time_format(segment["start"],segment["end"])}')
            f.write("\n")
            f.write( 
                (("[[" + segment["speaker"] + "]]") if "speaker" in segment else "") + " "
                + segment["text"].strip().replace("\t", " ")
            )
            f.write("\n\n")