jianfch / stable-ts

Transcription, forced alignment, and audio indexing with OpenAI's Whisper
MIT License
1.59k stars 177 forks source link

shift or expand end timestamp? #279

Closed drohack closed 10 months ago

drohack commented 10 months ago

I have an issue that no matter what I do the subtitle timing is always early, the text shows up before the speaker starts talking, and disappears before the person is done talking. I've tried all sorts of different models, using normal whisper vs faster_whisper, different transcribe settings, and using align, and refine. Nothing seems to work. This happens both on stable-ts and using faster_whisper directly.

I know there are other options for this to post process. But I'm wondering if others are having this issue. And if there would be an option to update stable-ts to give the option to shift the times as a whole. And ideally just shift the end times of segments, and only adjust the start times of the segments they overlap. (as i don't mind the subtitle showing up a little early). I was really hoping Refine would do this.

My default settings:

    model = stable_whisper.load_faster_whisper("large-v3", device="cuda", compute_type="auto", num_workers=5)
    result = model.transcribe_stable(audio=audio_file_path, beam_size=1, language='ja', temperature=0,
                                     word_timestamps=True, condition_on_previous_text=False, no_speech_threshold=0.1)

vad, and demucs really throw off the transcription so I don't normally use them. Though I'm not as familiar with their settings to be able to adjust their sensitivity well. Though I've had some luck using DeepFilterNet to pre-process the audio and reduce background noise. But that's because I could listen to the audio that it produced and fine tune the settings. (I probably need to do something similar for VAD or Demucs).

Align typically makes the matter even worse, but only slightly. It typically moves the end time per segment just a bit earlier. (which I want the opposite). And I can't get Refine to do anything. Every time I try and use it it produces the same result as the initial input to it... (It also doesn't work for transcribe_stable(), but that's a different issue).

I've triple checked that the audio I'm stripping from the video file is the same duration and sample rate as the original.

drohack commented 10 months ago

OK after a little bit of testing it seems like my subs are exactly 1 second off? I still don't know why. But I was able to fix it with pysrt

    result.to_srt_vtt(output_srt_path + ".stable_whisper.stable-jp-faster.srt")

    # Load SRT file
    subtitle_file_path = output_srt_path + ".stable_whisper.stable-jp-faster.srt"
    subs = pysrt.open(subtitle_file_path, encoding='utf-8')

    # Example: Increase subtitle timings by 0.7 seconds
    subs.shift(seconds=0.8)

    # Save the modified subtitles to a new file
    modified_subtitle_file_path = output_srt_path + ".stable_whisper.stable-jp-faster-shift8.srt"
    subs.save(modified_subtitle_file_path, encoding='utf-8')

It would be great to update stable-ts with the functionality of pysrt. Mainly the shift() function. I think if I loop through the results I could shift the times, but because the result object isn't very well defined I'm not exactly sure how.

jianfch commented 10 months ago

You can shift the all the timestamps of the result with result.offset_time() (e.g. increase by 0.5 second result.offset_time(0.5)).

But that's because I could listen to the audio that it produced and fine tune the settings. (I probably need to do something similar for VAD or Demucs).

You can listen to the audio output of Demucs with demucs_options=dict(save_path="demucs_output.mp3") for the transcribe functions. Note that Demucs by default will produce slightly different outputs each time you run it, but you can make it deterministic by setting the same seed each run random.seed(0). While an audio output based on the VAD is not currently supported, you can visualize it with https://github.com/jianfch/stable-ts?#visualizing-suppression.

Not that align() does not work as well with the larger models, especially large-v3. Try to use base model instead. refine() does not work well with results that have low probabilities to be begin with (e.g. results from align()).