linto-ai / whisper-timestamped

Multilingual Automatic Speech Recognition with word-level timestamps and confidence
GNU Affero General Public License v3.0
1.87k stars 150 forks source link

Punctuation and capitalisation #106

Closed mirix closed 1 year ago

mirix commented 1 year ago

Hi,

I am testing whisper-timestamped but the output is neither punctuated nor capitalised.

Here is my code:

result = whisper.transcribe(model, audio, beam_size=5, best_of=5, temperature=(0.0, 0.2, 0.4, 0.6, 0.8, 1.0), vad=True, detect_disfluencies=True)

def secondsToStr(t):
    return "%02d:%02d:%02d,%03d" % \
        reduce(lambda ll,b : divmod(ll[0],b) + ll[1:],
            [(round(t*1000),),1000,60,60])

with open(audio_track.split('.')[0] + '.srt', 'w', encoding = 'utf-8') as f:
    for segment in result['segments']:
        f.write(str(segment['id'] + 1) + '\n')
        f.write(secondsToStr(segment['start']) + ' --> ' + secondsToStr(segment['end']) + '\n')
        f.write(segment['text'] + '\n\n')

Is there a specific option or do I need to use json.dump or something?

Best,

Ed

mirix commented 1 year ago

Also, in some cases, utterances from different speakers are glued into the same segment. But I need to test several files to see if the prevalence of this issue is higher or lower than it is with other approaches.

Jeronymous commented 1 year ago

I am testing whisper-timestamped but the output is neither punctuated nor capitalised.

There is no particular reason why this should happen. Whisper is a statistical model, and Can you try with another model sizes? Also with vad=False?

If it persits, can you share the audio and tell which model size you use?

utterances from different speakers are glued into the same segment.

Yes, also because it's a statistical model, trained mostly on Youtube subtitles, where the segmentation into subtitles depends on the sentences lengths, and not necessarily on speaker turns... Whisper was not trained to do speaker diarization. I guess the speaker turns are quite short for you to observe that?

Here also, using vad=False might help.

mirix commented 1 year ago

The issue has vanished.