insanely-fast-whisper backend

ufal / whisper_streaming

Whisper realtime streaming for long speech-to-text transcription and translation

MIT License

2.08k stars 253 forks source link

insanely-fast-whisper backend #122

Closed marziye-A closed 1 month ago

marziye-A commented 1 month ago

hi ,thanks for your great work! i want to use the streaming mode with insanely fast whisper backend. i am adding this backend but i don't know what is the "ts_words" function? what is its utility and what it takes as input ?does the output of the whisper backend need to have timestamps?

can you please help me to understand this function? any help is really appreciated.

Gldkslfmsd commented 1 month ago

hi, thanks. Why do you need insanely fast whisper? As far as I know, it uses faster-whisper, same as ours.

What ts_word function do you mean? can you give link to the line where it is specified?

And yes, whisper-streaming needs word-level timestamps.

marziye-A commented 1 month ago

thank you for your answer. i think it doesn't use the faster whisper backend. its based on huggingface transformers and flash attention.

it is in this line for faster-whisper backend: https://github.com/ufal/whisper_streaming/blob/225f0383553a23f4ecc4c4751b73bc406a120c6c/whisper_online.py#L138

it is in this line for openai whisper backend: https://github.com/ufal/whisper_streaming/blob/225f0383553a23f4ecc4c4751b73bc406a120c6c/whisper_online.py#L185

and i want to implement this function for faster whisper backend.

Gldkslfmsd commented 1 month ago

Alright. ts_words is quite poorly documented here: https://github.com/ufal/whisper_streaming/blob/225f0383553a23f4ecc4c4751b73bc406a120c6c/whisper_online.py#L80 . It converts the object that comes from the transcribe function into an object that is the same for all backends -- a list of tuples (beg, end, word) where beg and end are floats -- seconds from beginning of the recording, in which the word was uttered. Word is string. In faster-whisper, it may be a subword, like "space-delimited" can be in two parts: " space" and "-delimited", they should not be joined with a space: https://github.com/ufal/whisper_streaming/blob/225f0383553a23f4ecc4c4751b73bc406a120c6c/whisper_online.py#L31

Gldkslfmsd commented 1 month ago

i think it doesn't use the faster whisper backend. its based on huggingface transformers and flash attention.

OK. I think the speed in insanely-fast-whisper is because of using large memory and batching. It's applicable only to the offline mode, you can chunk the whole long recording into small pieces and process them in parallel. In streaming mode, you can use batching like #55 and #42. It should speed a little but not too much.

But anyway, feel free to try it and share your latency-quality test results compared to faster-whisper. Or make a PR and I may do the test.