Need the same outputs for faster whisper and Stable whisper

jianfch / stable-ts

Transcription, forced alignment, and audio indexing with OpenAI's Whisper

MIT License

1.59k stars 177 forks source link

filepath = model1 = stable_whisper.load_faster_whisper('large-v2',download_root='./model/') options = dict(language='ko',beam_size=5,word_timestamps=True) result = model1.transcribe_stable(filepath,vad=True,**options) print(result)

model = stable_whisper.load_model(f"./checkpoints/{MODEL}.pt") LANGUAGE = 'ko' options = dict(language=LANGUAGE,beam_size=5,word_timestamps=True) result = model.transcribe(filepath,vad=True,**options) results1 = result.to_dict()

Are the VAD=True in faster whsiper and stable_whisper different?

vad=True does the same thing across all the transcription methods. It adjusts the timestamps after transcription is completed. So it is unlikely affect the transcript. The transcription is performed in 30 second chunks. The default transcribe method of stable-ts will skip the chunk that the VAD fails to detect any speech in. And if suppress_ts_tokens=True, it will only allow the decoder to return the segment timestamps within the time ranges that the VAD detects speech. These are the differences.

Few errors I see on the faster_whisper: The model hallucinates at the start and end "You You You".

Faster-whisper uses a different implementation of the model so there is bounded to be differences in the transcription result.

Also I noticed the output for stable-ts==2.5.0 was much better. Now few parts of the transcription are missed on both the new methods.

There were many additional postprocessing added since 2.5 to reduce/remove hallucinations. One of those can be adjusted with max_instant_words. https://github.com/jianfch/stable-ts/blob/eb00d291e54d82d381a967c30385002db0c8b1ae/stable_whisper/whisper_word_level.py#L178-L179

jianfch / stable-ts

Need the same outputs for faster whisper and Stable whisper #237