Any tips to speed up quantized Whisper inference on Android?

soupslurpr commented 10 months ago

Hello, running q80 quantized whisper on Android (Pixel 7) is taking around 15 seconds for 5 seconds of audio. Is there any way to speed this up that I might not be aware of or is it just because candle isn't as optimized as something like whisper.cpp yet? whisper.cpp took around 3 seconds or less if I remember correctly. Although this was with a q40 model. Thanks.

LaurentMazare commented 10 months ago

I'm not super familiar with how cross-compiling to android works, and what simd instructions/blas library could be available on these platforms. Most likely the compiled binary is not benefiting from these and this would explain the slowness (on normal builds and wasm, we have some specific build setup and code to hopefully use simd instructions). I just added a CANDLE_DEQUANTIZE_ALL environment variable that will force using the standard matmul rather than the quantized one, could you try running your tests with this set to 1 just in case?

soupslurpr commented 10 months ago

Wow, enabling that resulted in it taking only around 4.5 seconds!

LaurentMazare commented 10 months ago

Interesting, thanks for reporting this back. There are multiple things at play here and I'll have to dig a bit deeper to understand what is going on.

It could be that the quantized matmul doesn't detect the simd instructions but the normal matmul does (unlikely).
Using Q8_0 is slower than Q4_0, it's supposed to be optimized but maybe we've missed something.
The quantized matmul isn't as smart as the unquantized one when in comes to cache locality. In GPT like architectures, it's usually not an issue but the whisper encoding step might not have the same properties.

soupslurpr commented 10 months ago

Btw I was a bit off on the amount of time, I wasn't actually testing with 5 second audio but more like 2 seconds with 3 seconds of no sound. Also the way I was recording it was broken. But I just tried with it fixed and its still about the same difference, just saying in case you try and get slower than 4.5 seconds

soupslurpr commented 10 months ago

Q4_0 is even slower (without CANDLE_DEQUANTIZE_ALL set to 1) so it can't be that. Is there anything more I can do to try and figure out the cause of the problem? I could provide the source code of my app if needed because I'm going to open source it anyways.

rbrus commented 6 months ago

@soupslurpr can you share how you have made it running?

When I build cargo it always fails with such as errors: error: instruction requires: fullfp16 error: could not compile gemm-f16 (lib) due to 11 previous errors

soupslurpr commented 6 months ago

@rbrus are you sure you are using the Android NDK to compile?

For example, in .cargo/config.toml I specified:

[target.aarch64-linux-android]
ar = "C:/Users/user/AppData/Local/Android/Sdk/ndk/26.1.10909125/toolchains/llvm/prebuilt/windows-x86_64/bin/llvm-ar.exe"
linker = "C:/Users/user/AppData/Local/Android/Sdk/ndk/26.1.10909125/toolchains/llvm/prebuilt/windows-x86_64/bin/aarch64-linux-android21-clang.cmd"

rbrus commented 6 months ago

Thanks @soupslurpr . This seems to be working ,it just fails to build it. I am wondering if there is some issue with fp16 build for Android?

soupslurpr commented 6 months ago

@rbrus what version of candle?

soupslurpr commented 6 months ago

Also, are you building for the aarch64-linux-android target?

soupslurpr commented 6 months ago

And note that the config I provided probably needs to be changed for your Windows username

rbrus commented 6 months ago

Yes, and it always fails with the same error:

`error: instruction requires: fullfp16 --> /home/sus/.cargo/registry/src/index.crates.io-6f17d22bba15001f/gemm-common-0.17.0/src/simd.rs:1940:18	1940	"fmul {0:v}.8h, {1:v}.8h, {2:v}.8h",	^

note: instantiated into assembly here --> :1:2 | 1 | fmul v0.8h, v1.8h, v2.8h | ^

error: could not compile gemm-f16 (lib) due to 11 previous errors warning: build failed, waiting for other jobs to finish...`

I run it on Ubuntu 22.04 upgraded recently to 23.10. The version of rust and cargo and candle are most recent. I have setup it all today.

soupslurpr commented 6 months ago

So you changed the config.toml to go to the NDK you have downloaded?

rbrus commented 6 months ago

Yes, exactly, to NDK 25.2.9519653.

You haven't had such as issue?

soupslurpr commented 6 months ago

No, maybe try NDK 26.1.10909125?

soupslurpr commented 6 months ago

Also perhaps try adding this to the [target.aarch64-linux-android]

rustflags = [ "-C", "target-feature=+fp16,+neon", ]

I think I needed this before and had the same error as you, but it isn't needed for me anymore.

rbrus commented 6 months ago

@soupslurpr after changing the NDK it built! Huh, thanks for help!

By any chance, do you have a project which inference with the rlib?

soupslurpr commented 6 months ago

@rbrus great!

I do have a project I'm working on for running whisper speech to text on Android using Candle, but I'm not working on it currently as the speed is still too slow.

soupslurpr commented 6 months ago

@LaurentMazare idk if this helps but have you seen https://developer.android.com/ndk/guides/cpu-arm-neon

Does candle use neon?

Edit: looks like it does. Maybe https://developer.android.com/ndk/guides/neuralnetworks can be implemented as it seems to be for machine learning libraries and accelerates them?

akashicMarga commented 4 months ago

@soupslurpr how did you build candle with target as android I am getting openssl error. I am on mac and have setup the environment variable like below:

export AR="/Users/akashsingh/.NDK/arm64/bin/llvm-ar" export CC="/Users/akashsingh/.NDK/arm64/bin/aarch64-linux-android-clang"

soupslurpr commented 4 months ago

@singhaki I don't compile OpenSSL because it's a pain. I think there is a feature to disable that or actually might be in the hf_hub crate to disable networking.

akashicMarga commented 4 months ago

Can you provide me steps to compile for android I am trying to run phi-2 on a old redmi note 7 pro android device ? We can put it in this discussion if anyone else would be interested? https://github.com/huggingface/candle/discussions/2081

soupslurpr commented 3 months ago

Just tested and this issue actually happens with x86_64 too. Tested on Windows and the quantized Whisper is way slower (5x when measured using hyperfine, 10 seconds vs 2 seconds) than the unquantized / the one using the CANDLE_DEQUANTIZE_ALL set to 1 so it isn't Android specific.

bgergely0 commented 3 months ago

@singhaki you need to download openssl source, and set OPENSSL_DIR and OPENSSL_LIB_DIR to it to compile. Or at least, that's how I've done it.

huggingface / candle

Any tips to speed up quantized Whisper inference on Android? #1048