jianfch / stable-ts

Transcription, forced alignment, and audio indexing with OpenAI's Whisper
MIT License
1.59k stars 176 forks source link

How I get timestamps by parsing audio file #329

Closed david-95 closed 7 months ago

david-95 commented 7 months ago

Thank you for your efforts to look into my issue.

I am trying to parse a wav file, to find clip to match my text. firstly I call transcribe to get WhisperResult then call result.segments to get all segments tranverse segments to get the segment which text match my text but the segment.start and segment.end I cannot understand, I want to find the start-end timestamps , so I can cut the wav by calling "ffmpeg -i src.wav -ss start_timestamp -to end_timestamp -c copy tar.wav" but failed, the rootcause is segment.star and segment.end is not the timestamp, Can you please tell me how to get a segment's timestamp pair?

jianfch commented 7 months ago

Segment.start and Segment.end are the timestamps in seconds. The problem is that trimming with -c copy is not accurate. You need to specify a codec for reencoding the audio (e.g. -c:a aac tar.aac)

david-95 commented 7 months ago

What confuses me is the the segment.start is not identified with timestamps in .srt, please see: -----it's in .srt ----- 00:00:01,300 --> 00:00:01,720 Wuthering Heights

-----it's in sgement ----- (5.22, 5.32)

I guess the sgement is late behind the timestamps in srt, because it's the real time of model processing finished. If I am right, how can I get the correct timestamps ? how can I know the offset ?
It seems it needs more efforts to get matched timestamps from .srt file

david-95 commented 7 months ago

What confuses me is the the segment.start is not identified with timestamps in .srt, please see: -----it's in .srt ----- 00:00:01,300 --> 00:00:01,720 Wuthering Heights

-----it's in sgement ----- (5.22, 5.32)

I guess the sgement is late behind the timestamps in srt, because it's the real time of model processing finished. If I am right, how can I get the correct timestamps ? how can I know the offset ?
It seems it needs more efforts to get matched timestamps from .srt file

jianfch commented 7 months ago

The timestamps in the result are mostly finalized and should generally remain identical to timestamps in the output file except for parts with duration shorter than the min_dur, which is 0.02 second by default for all the result to output methods. But if this an edge case bug, it would be easier to figure out the cause if you can save the result as JSON and share it.

david-95 commented 7 months ago

I think I found the reason, I called the model.transcribe in different way, then get different result: --result=model.transcribe(audio_path); result.segments >>> list >:Segment(start=1.3, end=2.04, text=" Wuthering Heights")Segment(start=2.68, end=3.96, text=" by Emily Bronte")...

--result=model.transcribe(audio_path,,word_timestamps=False) ; result.segments >>> list >:Segment(start=5.22, end=5.32, text=" Wuthering Heights by Emily Bronte")Segment(start=5.32, end=5.7, text=" CHAPTER I")

I don't know why params words_timestamps=False makes such difference. but obviously it doesn't make sense

图片

jianfch commented 7 months ago

Generally, I'd advise against using word_timestamps=False because its timestamps are predicted via a less reliable method than that of used by word_timestamp=True (default). word_timestamps=False also severely limits the adjustments that can made for correcting the timestamps after the fact.

david-95 commented 7 months ago

Thanks for your help! I am changing my code