vitoplantamura / OnnxStream

Lightweight inference library for ONNX files, written in C++. It can run Stable Diffusion XL 1.0 on a RPI Zero 2 (or in 298MB of RAM) but also Mistral 7B on desktops and servers. ARM, x86, WASM, RISC-V supported. Accelerated by XNNPACK.
https://yolo.vitoplantamura.com/
Other
1.84k stars 82 forks source link

[feature request] Whisper with openblas #52

Open bil-ash opened 9 months ago

bil-ash commented 9 months ago

First of all thanks for this cool project. Now since you have started adding support for models other than stable diffusion, please also add support for whisper with W8A8 quantization. Also, seems xnnpack is for speeding up float operations. So does that mean for W8A8 inference xnnpack is not required? Also, consider adding openblas as a drop-in replacement for cublas so that gpu acceleration can also be used on intel and AMD CPUs with integrated graphics.

vitoplantamura commented 8 months ago

hi,

sorry for the late answer.

XNNPACK provides a set of operators for quantized operations (including 8-bit operations) as well. It may seem counterintuitive, but making "fast" 8-bit operators is more complex than making fast float operators, for example.

Regarding OpenBLAS, I don't know: I was actually thinking of hipBLAS (for AMD GPUs), since the cost in terms of code changes should be almost zero. However I will take a look at OpenBLAS to understand how complex the integration would be.

I'm no Whisper expert, but with projects like "insanely-fast-whisper", would it make sense? However the idea of ​​"Whisper on the Raspberry PI Zero 2" could be more interesting, perhaps :-)

Thanks, Vito

bil-ash commented 8 months ago

Okay, got the point regarding XNNPACK.

I guess you should implement hipBLAS first because you will have to make minimal changes. Just please allow overiding the GPU using AMDGPU_TARGETS like llama.cpp . Actually, I have a machine with AMD GPU which is not officially supported by hipBLAS and overriding helps me to run llama.cpp (with better performance than CPU) but I can't run anything which dosen't support overriding GPU , for example the onnxstream llm demo with GPU. So, I was asking for OpenBLAS support.

insanely-fast-whisper is aimed for servers with GPU. You could aim for CPU. Also, till now whatever whisper quantization for CPU inference implementations I have seen, all do only weight quantization. You could do both weight and activation quantization thereby reducing disk and memory usage.

vitoplantamura commented 8 months ago

ok, got it.

I will definitely look at how AMDGPU_TARGETS works in llama.cpp.

Regarding the issue of activations quantization, I suspect that it can't be done, but obviously I have to investigate. If the numerical ranges of the activations are too large, W8A8 quantization may produce results that are too imprecise or completely wrong. This might be why no one has done it already :-)

Vito