Closed plasmaphase closed 1 year ago
Is it possible to change a parameter to "relax" how sentences are broken, such that it will err on the side of longer times from one timestamp to the next?
combine_compound=True
should keep "I'm" as one word if the output has word level timestamps. Since this seem to be only segment/phrase level, this might be a bug. What were the arguments and functions you used to get these results?
I use the "medium" model, otherwise just defaults
model = load_model("medium")
result = model.transcribe(self.__audioFile)
results_to_sentence_srt(result, tsfile)
You can try:
from stable_whisper.text_output import finalize_segment_word_ts, to_srt
segs = finalize_segment_word_ts(result, combine_compound=True)
segs = [dict(text=''.join(i), start=j[0]['start'], end=j[-1]['end']) for i, j in segs]
to_srt(segs, tsfile)
A few subtle differences when looking at the whole srt file, some positive changes, but still some split words it seems.
7 00:00:24,680 --> 00:00:24,780 education finance and we'
8 00:00:24,880 --> 00:00:26,530 ll be doing
9 00:00:26,780 --> 00:00:30,240 that. I'
10 00:00:30,880 --> 00:00:33,650 m going to give you just a few
11 00:00:34,680 --> 00:00:36,200 housekeeping pieces of
12 00:00:37,720 --> 00:00:39,390 information. Your microphone is
It looks like it might a bug. Can save the results as json and share it?
stable_whisper.save_as_json(result, 'audio.json')
If you can't share it, check to see if there is a space before the text of the segment for "m going to give you just a few". If there is a space, share the tokens for that segment.
What am I missing here:
result = model.transcribe(self.__audioFile)
save_as_json(result, 'audio.json')
Output:
results = json.dumps(results, allow_nan=True)
File "/usr/lib/python3.10/json/__init__.py", line 231, in dumps
return _default_encoder.encode(obj)
File "/usr/lib/python3.10/json/encoder.py", line 199, in encode
chunks = self.iterencode(o, _one_shot=True)
File "/usr/lib/python3.10/json/encoder.py", line 257, in iterencode
return _iterencode(o, 0)
File "/usr/lib/python3.10/json/encoder.py", line 179, in default
raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type ndarray is not JSON serializable
Which version are you using?
from stable_whisper._version import __version__ as ver
print(ver)
You can try:
def list_all(x):
if isinstance(x, dict):
for k in x:
x[k] = list_all(x[k])
elif isinstance(x, (list, tuple)):
if isinstance(x, tuple):
x = list(x)
for i, j in enumerate(x):
x[i] = list_all(j)
elif isinstance(x, np.ndarray):
x = x.tolist()
return x
save_as_json(list_all(result), 'audio.json')
using version 1.3.0
The result is attached (zipped since github doesn't let me attach json file).
It does not appear to be a bug. This is the original output from whisper:
I' m going to talk ... you' re not speaking please turn it off they' re very sensitive
The model seems to predicting a space token after words ending with an apostrophe. You can try to use prompt
to nudge it to stop adding the space with examples without the spacing.
result = model.transcribe(self.__audioFile, prompt="I'm going to talk and they're very sensitive")
ver 2.0.0 now allows you merge the segments base on the end and start of each word.
result = model.transcribe(self.__audioFile)
result.merge_by_punctuation("'")
result.to_srt_vtt('sub.srt')
I'm noticing not only sentences being broken apart (which is probably ok), but that words themselves are broken from one timestamp to the next:
00:00:22,280 --> 00:00:23,820 m your
8 00:00:23,820 --> 00:00:26,220 chair for education finance and
9 00:00:26,220 --> 00:00:28,660 will be doing some introductions
10 00:00:28,660 --> 00:00:29,660 and I'
11 00:00:29,660 --> 00:00:32,660 m going to talk about the
12 00:00:32,660 --> 00:00:34,660 microphone. When you are