k2-fsa / sherpa-onnx

Speech-to-text, text-to-speech, speaker diarization, and VAD using next-gen Kaldi with onnxruntime without Internet connection. Support embedded systems, Android, iOS, Raspberry Pi, RISC-V, x86_64 servers, websocket server/client, C/C++, Python, Kotlin, C#, Go, NodeJS, Java, Swift, Dart, JavaScript, Flutter, Object Pascal, Lazarus, Rust
https://k2-fsa.github.io/sherpa/onnx/index.html
Apache License 2.0
3.6k stars 423 forks source link

Batch whisper inference #1525

Open thewh1teagle opened 4 days ago

thewh1teagle commented 4 days ago

Whisper model has limitation of 30s. Can you integrate batch inference into sherpa? I would like to use it along with the diarization.

I'm still not sure how exactly it possible to batch it but I have some idea: use silero-vad and aggregate segments into 30s (if there's smaller) add silence between. using word timestamps, estimate where's the silence added and reconstruct back the segments text.

https://github.com/thewh1teagle/loud.cpp/issues/11

thewh1teagle commented 3 days ago

image

https://github.com/m-bain/whisperX?tab=readme-ov-file#whisperx

csukuangfj commented 1 day ago

If you are using CPU, it won't make much difference in speed.

thewh1teagle commented 1 day ago

If you are using CPU, it won't make much difference in speed.

If we process speaker sentences of 5 seconds each time it will process it as 30 seconds, no? Also GPU is very important with whisper because it's heavy model and that makes much difference

csukuangfj commented 1 day ago

If we process speaker sentences of 5 seconds each time it will process it as 30 seconds, no?

I suggest that you have a look at the Moonshine models. It does not require padding.

thewh1teagle commented 1 day ago

I suggest that you have a look at the Moonshine models. It does not require padding.

unfortunately it supports only English