linto-ai / linto-stt

An automatic speech recognition API
GNU Affero General Public License v3.0
38 stars 14 forks source link

LinTO STT with Whisper #19

Closed Jeronymous closed 7 months ago

Jeronymous commented 1 year ago

Integration of Whisper in LinTO STT (openai whisper and faster_whisper).

The Whisper model to use is to be indicated with an environment variable MODEL, which can be either:

Note that: 1) There is currently no streaming capabilities (I removed everything related to this) 1) Models is loaded in a lazy mode, to avoid deadlock problems 1) With GPU, to avoid CUDA initialization errors, celery is run with --pool=solo / http server use gevent instead of gunicorn. Both implies that there won't be concurrent jobs...


DEPRECATED notes from the first implementation that used wav2vec models to align and have word timestamps:

The tricky part is to make word alignments of what recognize Whisper (which returns text on some discursive segments it detects, without any mean to recover word positions) It is currently done with a SpeechBrain wav2vec model (that returns probability on characters), hosted on LinTo website.

Other tricky things to be double checked in the future: 1) Whisper returns punctuation marks, digits (ex: "2", "11ème", "27/02/2003"), symbols ("€", "$"), emojis... For now they appear as Whisper give them in the field "text" but, currently the "words" can be different

Note: A repo that offers something very close (alignment with Whisper): https://github.com/m-bain/whisperX The main thing missing there is the text normalization (it will provide bad results or can simply fail when whisper recognize things like "1.20 €" for "one euro twenty")