jianfch / stable-ts

Transcription, forced alignment, and audio indexing with OpenAI's Whisper
MIT License
1.59k stars 176 forks source link

use transcribe_any to adjust whisper.app timestamps #269

Closed YeDaxia closed 9 months ago

YeDaxia commented 11 months ago

The whisper.app timestamps is really bad.

I try to use whisper.cpp with transcribe_any() , which vad=True.

The parameter format I passed in:

[{
        'start': 4.0, 
        'end': 8.64, 
        'text': ' or sky could contain us, our gaze hungered starward.', 
}...]

But I found the timestamps also not good enough. Is there any way to further improve whisper.app timestamps ?

jianfch commented 11 months ago

Having only segment timestamps severely limits the adjustments stable-ts can make. Try to enable the word timestamps for whisper.ccp and use this format:

essential_mapping = [
    [   # 1st Segment
        {'word': ' And', 'start': 0.0, 'end': 1.28}, 
        {'word': ' when', 'start': 1.28, 'end': 1.52}, 
        {'word': ' no', 'start': 1.52, 'end': 2.26}, 
        {'word': ' ocean,', 'start': 2.26, 'end': 2.68},
        {'word': ' mountain,', 'start': 3.28, 'end': 3.58}
    ], 
    [   # 2nd Segment
        {'word': ' or', 'start': 4.0, 'end': 4.08}, 
        {'word': ' sky', 'start': 4.08, 'end': 4.56}, 
        {'word': ' could', 'start': 4.56, 'end': 4.84}, 
        {'word': ' contain', 'start': 4.84, 'end': 5.26}, 
        {'word': ' us,', 'start': 5.26, 'end': 6.27},
        {'word': ' our', 'start': 6.27, 'end': 6.58}, 
        {'word': ' gaze', 'start': 6.58, 'end': 6.98}, 
        {'word': ' hungered', 'start': 6.98, 'end': 7.88}, 
        {'word': ' starward.', 'start': 7.88, 'end': 8.64}
    ],
    ...
]

The updated non-speech suppression in https://github.com/jianfch/stable-ts/commit/191674beefdddbce026732d5fd93026f85c26772 should make it make much more effective. So try to update stable-ts to 2.14.0+. See https://github.com/jianfch/stable-ts?#silence-suppression.

YeDaxia commented 10 months ago

The whisper.app only support token level output. anyway, I try to use token as word to adjust timestamps using transcribe_any

and I met a new problem, check this google notebook:

Timestamps are not in ascending order. If data is produced by Stable-ts, please submit an issue.

whisper.app full json : BenAndHollySE01.json

audio file : BenAndHollySE01.wav

I try to set check_sorted = false

stable_whisper.transcribe_any(inference, get_drive_path('BenAndHollySE01.wav'), vad=True, check_sorted = False)

And get some sentence was spilited into two parts, for example:

image

the original sentence in whisper.app full json should be:

{
 "offsets": {
  "from": 10000,
  "to": 15000
},
"text": " Everyone who lives here is very, very small."
},
{
"offsets": {
  "from": 17000,
  "to": 21000
 },
"text": " And I'm Princess Holly.",
}
jianfch commented 10 months ago

The splitting is expected behavior because of the default regrouping. If you want to preserve the original grouping use regroup=False. You can also customize the regrouping algorithm, see https://github.com/jianfch/stable-ts?#regrouping-words.

The timestamps not in ascending order is likely caused by bug in whisper.cpp. The start time jumped backward in the 15th line, 52.00 -> 55.74 -> 54.00. But stable-ts expects it to be in ascending order. You can use force_order=True to force it into order. https://github.com/jianfch/stable-ts/blob/ef0a87e036bacaa1b9d1261b82e74ba3118ac075/stable_whisper/non_whisper.py#L110-L112

YeDaxia commented 10 months ago

I try to use force_order=True, but the same problem appears:

stable_whisper.transcribe_any(inference, get_drive_path('BenAndHollySE01.wav'), vad=True, force_order=True, regroup=False)
[/usr/local/lib/python3.10/dist-packages/stable_whisper/result.py](https://localhost:8080/#) in raise_for_unsorted(self, check_sorted)
    695             data = self.to_dict()
    696             if check_sorted is True:
--> 697                 raise UnsortedException(data=data)
    698             warnings.warn('Timestamps are not in ascending order. '

and regroup=False works!

jianfch commented 10 months ago

The data has multiple consecutive token timestamps that are out of order. force_order=True should be able to handle it after 738fd98490584c492cf2f7873bdddaf7a0ec9d40. But keep in mind that force_order=True is a band-aid fix, some of the tokens will be set to zero duration. I'd suggest submitting an issue to the repo whisper.cpp to address the bug at its source.