ufal / whisper_streaming

Whisper realtime streaming for long speech-to-text transcription and translation
MIT License
1.78k stars 220 forks source link

Unable to get sentences, only segments #109

Closed Denis-Kazakov closed 1 month ago

Denis-Kazakov commented 1 month ago

I use whisper-timestamped backend because I could not get the faster-whisper working. I also use whisper_streaming in a Python code and simulate the online mode by processing 1-second long pieces of audio in a cycle. Despite setting buffer_trimming=("sentence", 15), I still get segments only.

I give my explanation of the problem and a workaround below. I am a language researcher, not coder, so I may be missing something.

Sentence tokenization is done in the chunk_completed_sentence function. The variable holding a list of sentences, sents, is not returned by the function and is not used anywhere. It is overwritten at each iteration and that's it.

My workaround was to add a new attribute to the OnlineASRProcessor class: self.sentences = []. And if the length of sents is longer thanself.sentences, the new items (except the last two which are still part of the buffer) are added to self.sentences and returned to the user.

There may be better solutions but it works, at least with the small Whisper model, though I still have stability problems with the medium model.

P.S. I also think lines 422-423 while len(sents) > 2: sents.pop(0) Could be slightly optimized to avoid iteration: if len(sents) > 2: sents = sents[-2:] But in fact these two lines are not needed at all because chunk_at = sents[-2][1] will have the same value without them.

Denis-Kazakov commented 1 month ago

The offline mode does not produce sentences either.

Denis-Kazakov commented 1 month ago

A workable workaround: output segments into yet another, external buffer and split off full sentences from that buffer.

Gldkslfmsd commented 1 month ago

Hi, getting sentences is not a core feature of real-time speech translation, it's not implemented on purpose because not all applications require them. I recommend postprocessing whisper-streaming's outputs with a sentence splitter tool. E.g. this could work well: https://github.com/Helsinki-NLP/opus-fast-mosestokenizer

Gldkslfmsd commented 1 month ago

Moreover, the buffer_trimming=("sentence", 15) option is kept only because of back-dependency. I implemented it initially, but then I realized that buffer trimming on segments gives better quality and latency (on English, German and Czech). So getting sentences from segment trimming is not available, you would need to postprocess them.

Otherwise I agree that the lines https://github.com/ufal/whisper_streaming/blob/38bab18afebd65d84eea401530e4f79d9ea77d36/whisper_online.py#L422C1-L423C25 may be ugly. But let's keep it now for simplicity.

Denis-Kazakov commented 1 month ago

Thank you! I understand the logic now. Yes, postprocessing is a viable option and Moses is more light-weight and better tokenizer for this task than NLTK that I used before.