niedev / RTranslator

Open source real-time translation app for Android that runs locally
Apache License 2.0
6.56k stars 494 forks source link

Where are the Whisper models defined? #25

Closed soupslurpr closed 3 months ago

soupslurpr commented 3 months ago

Hi, this app is pretty cool nice job, I was amazed by it's speed even though I read it's using the small Whisper model. For this reason I wanted to explore switching to using onnxruntime for running Whisper in my app Transcribro to see if I can switch to a bigger model while keeping the same speed (currently using tiny q8_0 with whisper.cpp). However, I couldn't find where the code that uses the Whisper model or how to use the Whisper model in onnxruntime. Could you direct me to an example or where this app uses the Whisper model? Thanks!

JingziC commented 3 months ago

I suppose that this app uses onnx runtime by importing ai.onnxruntime in Java. The code using whisper in onnxruntime might be Recognizer.java.

niedev commented 3 months ago

@soupslurpr Thank you for the appreciation, I like your project too. As @JingziC said, Whisper's inference logic is all inside the Recognizer class.

soupslurpr commented 3 months ago

@niedev Okay I see, but where do you get the Whisper models in onnx format or how do you convert it?

niedev commented 3 months ago

To get the whisper models you can just download them in the release of RTranslator 2.0 (all the models that start with "Whisper_".

If you want to convert them by yourself it is complicated, because I used Intel's quantized encoder and decoder (Whisper_encoder.onnx and Whisper_decoder.onnx). Then, from Whisper converted from pytorch to onnx, I extracted the components that generate the kV cache of the encoder (Whisper_cache_initializer.onnx). Then I converted Whisper to onnx with Microsoft Olive, and from there I extracted the components for generating the log-mel (Whisper_initializer.onnx) and the detokenizer (Whisper_detokenizer.onnx).

I could have directly used just the single .onnx model generated by Olive, but that model consumes 1.3GB of RAM, while using all these components separately it consumes:

But if you need just Whisper small you can simply use the models in the release of RTranslator that I linked above.

soupslurpr commented 3 months ago

For now I'll wait for Whisper to get an official example in onnxruntime as I want to easily use other sizes or finetunes if needed. Thanks for the help though!