Get true head & tail timestamps from segment first/last words

p4-k4 commented 1 year ago

I wonder, would it at all be possible to utilise the times of first and last words to get true timestamps when generating subtitles?

Currently, subtitles are gapless (don't start and end respective to dialogue) although I might have seen that it was being worked on over on the whisper repo.

jianfch commented 1 year ago

By "true timestamps", I'm assuming you mean accurate segment timestamps that matches the flow of how a human would time the dialogue. The start and end of each segment is dictated by the prediction of the model so that is not entire within our control. Lets say we force it to always end at a period or a specific word. Then that decoded ending timestamp is less likely to be accurate than what is produced by the current heuristics (that lets the model decide for itself when to end the segment). The "gapless" results is what the suppressing silence (or ignore silence) feature of stable-ts tries to reduce but it doesn't always work.

p4-k4 commented 1 year ago

Yeah in that case, it's outside of our control at least for now WRT the start/end of each segment.

The inverse of this would be the measurement of anything other than speech, which would then give us the correct start/end times of segments although it would be a post-process at least for now.

Lets say we force it to always end at a period or a specific word. Then that decoded ending timestamp is less likely to be accurate than what is produced by the current heuristics (that lets the model decide for itself when to end the segment).

Correct, I just checked and it's totally not accurate or reliable this way. Currently, I'll be using speechbrain as a post process to get start/end of segments but again that's inaccurate too.

Ah well, we'll see soon how things develop with this. Cheers

jianfch commented 1 year ago

ver 2.0.0 enables this now

jianfch / stable-ts

Get true head & tail timestamps from segment first/last words #87