Closed cfasana closed 4 weeks ago
Hii @cfasana , you need to put all the files of hexagon-v73 inside arm64-v8a, not as a separate folder, hope it'll solve your issue.
Thanks for the reply. Unfortunately I already tried pasting the files from hexagon-v73 inside arm64-v8a, however, the issue remains the same. Indeed, the fp32 DLC model provided the correct transcriptions independently of whether I had the files in a separate folder or not. This is why I was thinking that there may be a problem with the quantized DLC model I generated.
Hii, @cfasana Thanks for your update. Can you please check these points if possible?
Please let me know the update.
Thanks
Hi @SahinMjks, thanks for the support.
First of all, I am running the experiments in WSL2 Ubuntu 22.04
if this can be of any help. Note however that I also tried to use WSL2 Ubuntu 20.04
and I experienced the same issue.
Concerning your questions:
I am using the latest release of SNPE: v2.20.0.240223 and Android NDK r21e when running the Python notebook
I ran the notebook whisper_notebook.ipynb
and these are the results on the Linux machine:
FP32 Model
W8A8 Model
W8A16 Model
It seems that on the Linux machine all the models are working correctly.
However, when placed in the Android app, only the FP32 model works: the W8A8 and W8A16 models do not.
I can share these two models with you if necessary.
I confirm that these 3 libraries are indeed present in the jniLibs/arm64-v8a folder
Finally, In the last cell of the whisper_notebook.ipynb
, I also noticed that there is a reference to the function decoder_block_onnx
which however cannot be found. This is not related to the issue but I wanted to know if it's just a typo and decoder_block_tflite
should be used, or whether I am missing this function.
Thanks
Hii @cfasana , Thanks a lot for the detail.
Can you please use these below code to create w816 and w8a8 model, this will ensure you're not creating any cached model. During model loading time in your device it'll automatically create a cached version of the model.
snpe-dlc-quantize --input_dlc whisper_encoder_fp32.dlc --input_list list.txt --output_dlc whisper_tiny_encoder_w8a16.dlc --weights_bitwidth 8 --act_bitwidth 16
snpe-dlc-quantize --input_dlc whisper_encoder_fp32.dlc --input_list list.txt --output_dlc whisper_tiny_encoder_w8a16.dlc --weights_bitwidth 8 --act_bitwidth 8
Hope this will solve your issue, please let me know.
Thanks
waiting for user to confirm, if issue is solved or not.
Hii @cfasana , Thanks a lot for the detail. Can you please use these below code to create w816 and w8a8 model, this will ensure you're not creating any cached model. During model loading time in your device it'll automatically create a cached version of the model.
snpe-dlc-quantize --input_dlc whisper_encoder_fp32.dlc --input_list list.txt --output_dlc whisper_tiny_encoder_w8a16.dlc --weights_bitwidth 8 --act_bitwidth 16
snpe-dlc-quantize --input_dlc whisper_encoder_fp32.dlc --input_list list.txt --output_dlc whisper_tiny_encoder_w8a16.dlc --weights_bitwidth 8 --act_bitwidth 8
Hope this will solve your issue, please let me know.
Thanks
Hello @SahinMjks,
it seems like the issue was indeed a cached version of the model.
Using the commands above, the logits are no longer zeroed for both the W8A8
and W8A16
models, and the transcription works fine. Thanks for the support on this!
However, the inference time seems around 1000ms. Is it correct? According to the latest release of the AI Hub, Whisper Tiny should run in less than 100ms. Is the DSP being used or is there a part of the picture I'm not understanding?
To measure the inference time, I added a couple of lines to the code of MainActivity.java
, before the comment line Running the Encoder Model on DSP
and after the comment Inferencing the TFLite Model
:
Is this way of measuring the inference time accurate enough?
Hii @cfasana , Thanks a lot for the detail. Can you please use these below code to create w816 and w8a8 model, this will ensure you're not creating any cached model. During model loading time in your device it'll automatically create a cached version of the model.
snpe-dlc-quantize --input_dlc whisper_encoder_fp32.dlc --input_list list.txt --output_dlc whisper_tiny_encoder_w8a16.dlc --weights_bitwidth 8 --act_bitwidth 16
snpe-dlc-quantize --input_dlc whisper_encoder_fp32.dlc --input_list list.txt --output_dlc whisper_tiny_encoder_w8a16.dlc --weights_bitwidth 8 --act_bitwidth 8
Hope this will solve your issue, please let me know. ThanksHello @SahinMjks, it seems like the issue was indeed a cached version of the model. Using the commands above, the logits are no longer zeroed for both the
W8A8
andW8A16
models, and the transcription works fine. Thanks for the support on this!However, the inference time seems around 1000ms. Is it correct? According to the latest release of the AI Hub, Whisper Tiny should run in less than 100ms. Is the DSP being used or is there a part of the picture I'm not understanding?
To measure the inference time, I added a couple of lines to the code of
MainActivity.java
, before the comment lineRunning the Encoder Model on DSP
and after the commentInferencing the TFLite Model
: Is this way of measuring the inference time accurate enough?
To my understanding, 53 ms is only the whisper-tiny-encoder time. Total inference time should be encoder_time + decoder_time * token_count.
Please post any AI hub questions in AI hub forums. AI hub will just run the model on HTP. Here in android application, there will be few optimizations needed. We have a separate repository ai-hub-apps in progress for optimal demonstration of AI hub models.
I was trying to run the example provided for ASR using OpenAI Whisper. Following the steps reported here, I managed to generate the quantised Whisper Encoder in DLC format
whisper_tiny_encoder_w8a16.dlc
.I downloaded the decoder from the provided link, and I pasted both models inside the folder
qidk\Solutions\NLPSolution3-AutomaticSpeechRecognition-Whisper\Android_App_Whisper\app\src\main\ml
. I placed the required libraries underjniLibs\arm64-v8a
and I copy-pasted thehexagon-v73
folder insidejniLibs
. Below is my resulting folder structure:Finally, inside the file
native-lib_pre_post.cpp
I modified the model name fromwhisper_tiny_encoder_w8a8.dlc
towhisper_tiny_encoder_w8a16.dlc
to match the one obtained from the assets generation process.The problem I encountered is that when I run the Android application, the output is always the character
"|"
, and the logits are an array of zeros:If instead I use the fp32 DLC model, I can get the transcriptions.