jianfch / stable-ts

Transcription, forced alignment, and audio indexing with OpenAI's Whisper
MIT License
1.61k stars 178 forks source link

Broken sentences and even words when transcribing #88

Closed plasmaphase closed 1 year ago

plasmaphase commented 1 year ago

I'm noticing not only sentences being broken apart (which is probably ok), but that words themselves are broken from one timestamp to the next:

00:00:22,280 --> 00:00:23,820 m your

8 00:00:23,820 --> 00:00:26,220 chair for education finance and

9 00:00:26,220 --> 00:00:28,660 will be doing some introductions

10 00:00:28,660 --> 00:00:29,660 and I'

11 00:00:29,660 --> 00:00:32,660 m going to talk about the

12 00:00:32,660 --> 00:00:34,660 microphone. When you are

plasmaphase commented 1 year ago

Is it possible to change a parameter to "relax" how sentences are broken, such that it will err on the side of longer times from one timestamp to the next?

jianfch commented 1 year ago

combine_compound=True should keep "I'm" as one word if the output has word level timestamps. Since this seem to be only segment/phrase level, this might be a bug. What were the arguments and functions you used to get these results?

plasmaphase commented 1 year ago

I use the "medium" model, otherwise just defaults

model = load_model("medium")
result = model.transcribe(self.__audioFile)
results_to_sentence_srt(result, tsfile)
jianfch commented 1 year ago

You can try:

from stable_whisper.text_output import finalize_segment_word_ts, to_srt

segs = finalize_segment_word_ts(result, combine_compound=True)
segs = [dict(text=''.join(i), start=j[0]['start'], end=j[-1]['end']) for i, j in segs]
to_srt(segs, tsfile)
plasmaphase commented 1 year ago

A few subtle differences when looking at the whole srt file, some positive changes, but still some split words it seems.

7 00:00:24,680 --> 00:00:24,780 education finance and we'

8 00:00:24,880 --> 00:00:26,530 ll be doing

9 00:00:26,780 --> 00:00:30,240 that. I'

10 00:00:30,880 --> 00:00:33,650 m going to give you just a few

11 00:00:34,680 --> 00:00:36,200 housekeeping pieces of

12 00:00:37,720 --> 00:00:39,390 information. Your microphone is

jianfch commented 1 year ago

It looks like it might a bug. Can save the results as json and share it?

stable_whisper.save_as_json(result, 'audio.json')

If you can't share it, check to see if there is a space before the text of the segment for "m going to give you just a few". If there is a space, share the tokens for that segment.

plasmaphase commented 1 year ago

What am I missing here:

result = model.transcribe(self.__audioFile)
save_as_json(result, 'audio.json')

Output:

   results = json.dumps(results, allow_nan=True)
  File "/usr/lib/python3.10/json/__init__.py", line 231, in dumps
    return _default_encoder.encode(obj)
  File "/usr/lib/python3.10/json/encoder.py", line 199, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/usr/lib/python3.10/json/encoder.py", line 257, in iterencode
    return _iterencode(o, 0)
  File "/usr/lib/python3.10/json/encoder.py", line 179, in default
    raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type ndarray is not JSON serializable
jianfch commented 1 year ago

Which version are you using?

from stable_whisper._version import __version__ as ver
print(ver)

You can try:

def list_all(x):
    if isinstance(x, dict):
        for k in x:
            x[k] = list_all(x[k])
    elif isinstance(x, (list, tuple)):
        if isinstance(x, tuple):
            x = list(x)
        for i, j in enumerate(x):
            x[i] = list_all(j)
    elif isinstance(x, np.ndarray):
        x = x.tolist()
    return x

save_as_json(list_all(result), 'audio.json')
plasmaphase commented 1 year ago

using version 1.3.0

plasmaphase commented 1 year ago

The result is attached (zipped since github doesn't let me attach json file).

audiojson.zip

jianfch commented 1 year ago

It does not appear to be a bug. This is the original output from whisper: I' m going to talk ... you' re not speaking please turn it off they' re very sensitive The model seems to predicting a space token after words ending with an apostrophe. You can try to use prompt to nudge it to stop adding the space with examples without the spacing.

result = model.transcribe(self.__audioFile, prompt="I'm going to talk and they're very sensitive")
jianfch commented 1 year ago

ver 2.0.0 now allows you merge the segments base on the end and start of each word.

result = model.transcribe(self.__audioFile)
result.merge_by_punctuation("'")
result.to_srt_vtt('sub.srt')