Closed YeDaxia closed 9 months ago
Having only segment timestamps severely limits the adjustments stable-ts can make. Try to enable the word timestamps for whisper.ccp and use this format:
essential_mapping = [
[ # 1st Segment
{'word': ' And', 'start': 0.0, 'end': 1.28},
{'word': ' when', 'start': 1.28, 'end': 1.52},
{'word': ' no', 'start': 1.52, 'end': 2.26},
{'word': ' ocean,', 'start': 2.26, 'end': 2.68},
{'word': ' mountain,', 'start': 3.28, 'end': 3.58}
],
[ # 2nd Segment
{'word': ' or', 'start': 4.0, 'end': 4.08},
{'word': ' sky', 'start': 4.08, 'end': 4.56},
{'word': ' could', 'start': 4.56, 'end': 4.84},
{'word': ' contain', 'start': 4.84, 'end': 5.26},
{'word': ' us,', 'start': 5.26, 'end': 6.27},
{'word': ' our', 'start': 6.27, 'end': 6.58},
{'word': ' gaze', 'start': 6.58, 'end': 6.98},
{'word': ' hungered', 'start': 6.98, 'end': 7.88},
{'word': ' starward.', 'start': 7.88, 'end': 8.64}
],
...
]
The updated non-speech suppression in https://github.com/jianfch/stable-ts/commit/191674beefdddbce026732d5fd93026f85c26772 should make it make much more effective. So try to update stable-ts to 2.14.0+. See https://github.com/jianfch/stable-ts?#silence-suppression.
The whisper.app only support token level output. anyway, I try to use token as word to adjust timestamps using transcribe_any
and I met a new problem, check this google notebook:
Timestamps are not in ascending order. If data is produced by Stable-ts, please submit an issue.
whisper.app full json : BenAndHollySE01.json
audio file : BenAndHollySE01.wav
I try to set check_sorted = false
stable_whisper.transcribe_any(inference, get_drive_path('BenAndHollySE01.wav'), vad=True, check_sorted = False)
And get some sentence was spilited into two parts, for example:
the original sentence in whisper.app full json should be:
{
"offsets": {
"from": 10000,
"to": 15000
},
"text": " Everyone who lives here is very, very small."
},
{
"offsets": {
"from": 17000,
"to": 21000
},
"text": " And I'm Princess Holly.",
}
The splitting is expected behavior because of the default regrouping. If you want to preserve the original grouping use regroup=False
. You can also customize the regrouping algorithm, see https://github.com/jianfch/stable-ts?#regrouping-words.
The timestamps not in ascending order is likely caused by bug in whisper.cpp. The start time jumped backward in the 15th line, 52.00 -> 55.74 -> 54.00. But stable-ts expects it to be in ascending order. You can use force_order=True
to force it into order.
https://github.com/jianfch/stable-ts/blob/ef0a87e036bacaa1b9d1261b82e74ba3118ac075/stable_whisper/non_whisper.py#L110-L112
I try to use force_order=True, but the same problem appears:
stable_whisper.transcribe_any(inference, get_drive_path('BenAndHollySE01.wav'), vad=True, force_order=True, regroup=False)
[/usr/local/lib/python3.10/dist-packages/stable_whisper/result.py](https://localhost:8080/#) in raise_for_unsorted(self, check_sorted)
695 data = self.to_dict()
696 if check_sorted is True:
--> 697 raise UnsortedException(data=data)
698 warnings.warn('Timestamps are not in ascending order. '
and regroup=False works!
The data has multiple consecutive token timestamps that are out of order. force_order=True
should be able to handle it after 738fd98490584c492cf2f7873bdddaf7a0ec9d40. But keep in mind that force_order=True
is a band-aid fix, some of the tokens will be set to zero duration. I'd suggest submitting an issue to the repo whisper.cpp to address the bug at its source.
The whisper.app timestamps is really bad.
I try to use whisper.cpp with transcribe_any() , which vad=True.
The parameter format I passed in:
But I found the timestamps also not good enough. Is there any way to further improve whisper.app timestamps ?