linto-ai / whisper-timestamped

Multilingual Automatic Speech Recognition with word-level timestamps and confidence
GNU Affero General Public License v3.0
2.08k stars 159 forks source link

Import silero instead of downloading it #214

Open villesau opened 1 month ago

villesau commented 1 month ago

Silero can be imported, which means no torch etc needed: https://github.com/snakers4/silero-vad?tab=readme-ov-file#fast-start

This would make it easier to package the whisper-timestamped to docker as it would avoid the hassle of predownloading Silero to a specific folder. I see that audiotok is already imported this way.

I believe big chunk of this could be gotten rid of along the way: https://github.com/linto-ai/whisper-timestamped/blob/master/whisper_timestamped/transcribe.py#L1952-L1981 and done in this way instead: https://github.com/linto-ai/whisper-timestamped/blob/master/whisper_timestamped/transcribe.py#L2007

The user of this library could then pin the silero version in their requirements.txt

As a sidenote, I also think Silero could perhaps be used to further enhance the timestamp accuracy 🤔 Based on my quick testing WhisperX has still a slight edge over Whisper-timestamped. It gives both more precise (three digits vs two digits) and more accurate results.

Jeronymous commented 1 month ago

Silero can be imported, which means no torch etc needed

torch is needed by openai-whisper (the backend of whisper-timestamped) https://github.com/openai/whisper/blob/main/requirements.txt#L3 so to this regard, how silero is imported will change nothing (and probably silero itself uses torch...)

Also note that silero is not in the requirements of whisper-timestamped https://github.com/linto-ai/whisper-timestamped/blob/master/requirements.txt It is only needed if silero is used.

I believe big chunk of this could be gotten rid of along the way

This ugly piece of code is a workaround to be able to reach old version of silero (because there are some issues in the early packagings of silero). The thing is that we saw a performance degradation (for our use cases) with the last versions of silero, so we decided to continue maintaining old version of silero (making them accessible), despite some ugly code to make that work. Also for the sake of reproducibility for people experimenting with silero + whisper.

The current packaging of silero also allows to have several silero models (versions) on the same system. Which make possible to call whisper-timestamped with different silero settings with a unique integration.

I understand that for some use cases it might be useful to use:

from silero_vad import load_silero_vad, read_audio, get_speech_timestamps

An option "vad=silero_from_pip" (or similar name) could be implemented to switch to the silero that is installed with pip/python. Maybe you can open a fork and a PR with that suggestion?

villesau commented 1 month ago

silero_from_pip could make sense! This is what I ended up doing when deploying to Replicate: https://github.com/villesau/whisper-timestamped-replicate/commit/367dd5704a8c96bd687cdd20bb964e3eb1ba7d45 Needs to be run before building.

villesau commented 1 month ago

As a sidenote, I also think Silero could perhaps be used to further enhance the timestamp accuracy 🤔 Based on my quick testing WhisperX has still a slight edge over Whisper-timestamped. It gives both more precise (three digits vs two digits) and more accurate results.

@Jeronymous do you think this makes any sense? I'm not too familiar with each, but what I understood about silero-vad, there could be a chance for improved accuracy.