Closed Denis-Kazakov closed 1 month ago
The offline mode does not produce sentences either.
A workable workaround: output segments into yet another, external buffer and split off full sentences from that buffer.
Hi, getting sentences is not a core feature of real-time speech translation, it's not implemented on purpose because not all applications require them. I recommend postprocessing whisper-streaming's outputs with a sentence splitter tool. E.g. this could work well: https://github.com/Helsinki-NLP/opus-fast-mosestokenizer
Moreover, the buffer_trimming=("sentence", 15)
option is kept only because of back-dependency. I implemented it initially, but then I realized that buffer trimming on segments gives better quality and latency (on English, German and Czech). So getting sentences from segment trimming is not available, you would need to postprocess them.
Otherwise I agree that the lines https://github.com/ufal/whisper_streaming/blob/38bab18afebd65d84eea401530e4f79d9ea77d36/whisper_online.py#L422C1-L423C25 may be ugly. But let's keep it now for simplicity.
Thank you! I understand the logic now. Yes, postprocessing is a viable option and Moses is more light-weight and better tokenizer for this task than NLTK that I used before.
I use whisper-timestamped backend because I could not get the faster-whisper working. I also use whisper_streaming in a Python code and simulate the online mode by processing 1-second long pieces of audio in a cycle. Despite setting
buffer_trimming=("sentence", 15)
, I still get segments only.I give my explanation of the problem and a workaround below. I am a language researcher, not coder, so I may be missing something.
Sentence tokenization is done in the chunk_completed_sentence function. The variable holding a list of sentences,
sents
, is not returned by the function and is not used anywhere. It is overwritten at each iteration and that's it.My workaround was to add a new attribute to the OnlineASRProcessor class:
self.sentences = []
. And if the length ofsents
is longer thanself.sentences
, the new items (except the last two which are still part of the buffer) are added toself.sentences
and returned to the user.There may be better solutions but it works, at least with the small Whisper model, though I still have stability problems with the medium model.
P.S. I also think lines 422-423
while len(sents) > 2: sents.pop(0)
Could be slightly optimized to avoid iteration:if len(sents) > 2: sents = sents[-2:]
But in fact these two lines are not needed at all becausechunk_at = sents[-2][1]
will have the same value without them.