Closed marziye-A closed 1 month ago
hi, thanks. Why do you need insanely fast whisper? As far as I know, it uses faster-whisper, same as ours.
What ts_word function do you mean? can you give link to the line where it is specified?
And yes, whisper-streaming needs word-level timestamps.
thank you for your answer. i think it doesn't use the faster whisper backend. its based on huggingface transformers and flash attention.
it is in this line for faster-whisper backend: https://github.com/ufal/whisper_streaming/blob/225f0383553a23f4ecc4c4751b73bc406a120c6c/whisper_online.py#L138
it is in this line for openai whisper backend: https://github.com/ufal/whisper_streaming/blob/225f0383553a23f4ecc4c4751b73bc406a120c6c/whisper_online.py#L185
and i want to implement this function for faster whisper backend.
Alright. ts_words is quite poorly documented here: https://github.com/ufal/whisper_streaming/blob/225f0383553a23f4ecc4c4751b73bc406a120c6c/whisper_online.py#L80 . It converts the object that comes from the transcribe function into an object that is the same for all backends -- a list of tuples (beg, end, word) where beg and end are floats -- seconds from beginning of the recording, in which the word was uttered. Word is string. In faster-whisper, it may be a subword, like "space-delimited" can be in two parts: " space" and "-delimited", they should not be joined with a space: https://github.com/ufal/whisper_streaming/blob/225f0383553a23f4ecc4c4751b73bc406a120c6c/whisper_online.py#L31
i think it doesn't use the faster whisper backend. its based on huggingface transformers and flash attention.
OK. I think the speed in insanely-fast-whisper is because of using large memory and batching. It's applicable only to the offline mode, you can chunk the whole long recording into small pieces and process them in parallel. In streaming mode, you can use batching like #55 and #42. It should speed a little but not too much.
But anyway, feel free to try it and share your latency-quality test results compared to faster-whisper. Or make a PR and I may do the test.
hi ,thanks for your great work! i want to use the streaming mode with insanely fast whisper backend. i am adding this backend but i don't know what is the "ts_words" function? what is its utility and what it takes as input ?does the output of the whisper backend need to have timestamps?
can you please help me to understand this function? any help is really appreciated.