LinTO STT with Whisper - Githubissues

Integration of Whisper in LinTO STT (openai whisper and faster_whisper).

The Whisper model to use is to be indicated with an environment variable MODEL, which can be either:

a path to the model as it is mounted in the docker
a Whisper denomination (small / medium / large / ...). In this case, the model will be downloaded when launching the container.

Note that: 1) There is currently no streaming capabilities (I removed everything related to this) 1) Models is loaded in a lazy mode, to avoid deadlock problems 1) With GPU, to avoid CUDA initialization errors, celery is run with --pool=solo / http server use gevent instead of gunicorn. Both implies that there won't be concurrent jobs...

DEPRECATED notes from the first implementation that used wav2vec models to align and have word timestamps:

The tricky part is to make word alignments of what recognize Whisper (which returns text on some discursive segments it detects, without any mean to recover word positions) It is currently done with a SpeechBrain wav2vec model (that returns probability on characters), hosted on LinTo website.

Other tricky things to be double checked in the future: 1) Whisper returns punctuation marks, digits (ex: "2", "11ème", "27/02/2003"), symbols ("€", "$"), emojis... For now they appear as Whisper give them in the field "text" but, currently the "words" can be different

words do not include digits ("11ème" is transformed into "onzième", "999" into 3 words: "neuf", "cent", "quatre-vingt-dix-neuf"...)
words do not include symbols ("%" -> "per cent", "€" -> "euros", ...). Well it may on some that I forgot.
possible emojis coming from Whisper are removed from the text and the words
but words can include punctuation marks (coma, dots, ...) 1) Not all languages supported by Whisper are supported for the word alignment part.
Only French and English are fully supported for text normalization when there are digits and symbols in the transcription.
The languages for which we provide default alignment models are: english, french, spanish, italian, portuguese, dutch, polish, finnish, hungarian, greek, persian, arabic, russian, ukranian, japanese, chinese, vietnamese. TODO: This can be completed by googling more...
TODO: test on languages without spaces (ja, zh) 1) Word confidence scores are given by the alignment model and might be irrelevant. 1) Whisper removes disfluencies ("euh", "hmm", some word repetitions...). This can be a problem for the accuracy of word alignement. TODO: test that 1) Alignment can do weird stuff when no character in a word is handled by the alignement model.
TODO: Test and improve this 1) The alignement models must process signal in 16kHz (which is the case for most of pretrained models...)
TODO: Add a check 1) The way speechbrain models are loaded is currently ugly. 1) TODO: test multi-lingual transcription when there is no wav2vec alignment model for a language

Note: A repo that offers something very close (alignment with Whisper): https://github.com/m-bain/whisperX The main thing missing there is the text normalization (it will provide bad results or can simply fail when whisper recognize things like "1.20 €" for "one euro twenty")

linto-ai / linto-stt

LinTO STT with Whisper #19