jianfch / stable-ts

Transcription, forced alignment, and audio indexing with OpenAI's Whisper
MIT License
1.59k stars 177 forks source link

Need the same outputs for faster whisper and Stable whisper #237

Closed sthita-pujari closed 10 months ago

sthita-pujari commented 1 year ago
filepath = 
model1 = stable_whisper.load_faster_whisper('large-v2',download_root='./model/')
options = dict(language='ko',beam_size=5,word_timestamps=True)
result = model1.transcribe_stable(filepath,vad=True,**options)
print(result)
model = stable_whisper.load_model(f"./checkpoints/{MODEL}.pt")
LANGUAGE = 'ko'
options = dict(language=LANGUAGE,beam_size=5,word_timestamps=True)
result = model.transcribe(filepath,vad=True,**options)
results1 = result.to_dict()

For both the setting the output is different. Also I noticed the output for stable-ts==2.5.0 was much better. Now few parts of the transcription are missed on both the new methods.

Few errors I see on the faster_whisper: The model hallucinates at the start and end "You You You". Are the VAD=True in faster whsiper and stable_whisper different?

jianfch commented 1 year ago

Are the VAD=True in faster whsiper and stable_whisper different?

vad=True does the same thing across all the transcription methods. It adjusts the timestamps after transcription is completed. So it is unlikely affect the transcript. The transcription is performed in 30 second chunks. The default transcribe method of stable-ts will skip the chunk that the VAD fails to detect any speech in. And if suppress_ts_tokens=True, it will only allow the decoder to return the segment timestamps within the time ranges that the VAD detects speech. These are the differences.

Few errors I see on the faster_whisper: The model hallucinates at the start and end "You You You".

Faster-whisper uses a different implementation of the model so there is bounded to be differences in the transcription result.

Also I noticed the output for stable-ts==2.5.0 was much better. Now few parts of the transcription are missed on both the new methods.

There were many additional postprocessing added since 2.5 to reduce/remove hallucinations. One of those can be adjusted with max_instant_words. https://github.com/jianfch/stable-ts/blob/eb00d291e54d82d381a967c30385002db0c8b1ae/stable_whisper/whisper_word_level.py#L178-L179