Zeroed logits with Whisper quantized model

cfasana commented 3 months ago

I was trying to run the example provided for ASR using OpenAI Whisper. Following the steps reported here, I managed to generate the quantised Whisper Encoder in DLC format whisper_tiny_encoder_w8a16.dlc.

I downloaded the decoder from the provided link, and I pasted both models inside the folder qidk\Solutions\NLPSolution3-AutomaticSpeechRecognition-Whisper\Android_App_Whisper\app\src\main\ml. I placed the required libraries under jniLibs\arm64-v8a and I copy-pasted the hexagon-v73 folder inside jniLibs. Below is my resulting folder structure:

Finally, inside the file native-lib_pre_post.cpp I modified the model name from whisper_tiny_encoder_w8a8.dlc to whisper_tiny_encoder_w8a16.dlc to match the one obtained from the assets generation process.

The problem I encountered is that when I run the Android application, the output is always the character "|", and the logits are an array of zeros:

If instead I use the fp32 DLC model, I can get the transcriptions.

What could be the reason behind this behaviour? Did something go wrong during the DLC quantization step?
Moreover, the inference time is around 500 ms for the quantized model. Is it correct or is there the possibility that the runtime fell back to CPU?

SahinMjks commented 3 months ago

Hii @cfasana , you need to put all the files of hexagon-v73 inside arm64-v8a, not as a separate folder, hope it'll solve your issue.

cfasana commented 3 months ago

Thanks for the reply. Unfortunately I already tried pasting the files from hexagon-v73 inside arm64-v8a, however, the issue remains the same. Indeed, the fp32 DLC model provided the correct transcriptions independently of whether I had the files in a separate folder or not. This is why I was thinking that there may be a problem with the quantized DLC model I generated.

SahinMjks commented 3 months ago

Hii, @cfasana Thanks for your update. Can you please check these points if possible?

Please check you're using the latest Qualcomm-SNPE-2.20 version (latest version)
Check the output of your whisper_encoder_w8a16.dlc and whisper_encoder_w8a8.dlc output in your linux machine, so that you can know it's a model issue or android app issue.
Please double check once if all these below files from hexagon-v73 and SNPE-ROOT/lib/aarch64-android is inside the app/src/main/jniLibs/arm64-v8a.

Please let me know the update.

Thanks

cfasana commented 3 months ago

Hi @SahinMjks, thanks for the support.

First of all, I am running the experiments in WSL2 Ubuntu 22.04 if this can be of any help. Note however that I also tried to use WSL2 Ubuntu 20.04 and I experienced the same issue.

Concerning your questions:

I am using the latest release of SNPE: v2.20.0.240223 and Android NDK r21e when running the Python notebook
I ran the notebook whisper_notebook.ipynb and these are the results on the Linux machine: FP32 Model W8A8 Model W8A16 Model It seems that on the Linux machine all the models are working correctly. However, when placed in the Android app, only the FP32 model works: the W8A8 and W8A16 models do not. I can share these two models with you if necessary.
I confirm that these 3 libraries are indeed present in the jniLibs/arm64-v8a folder

Finally, In the last cell of the whisper_notebook.ipynb, I also noticed that there is a reference to the function decoder_block_onnx which however cannot be found. This is not related to the issue but I wanted to know if it's just a typo and decoder_block_tflite should be used, or whether I am missing this function.

Thanks

SahinMjks commented 3 months ago

Hii @cfasana , Thanks a lot for the detail. Can you please use these below code to create w816 and w8a8 model, this will ensure you're not creating any cached model. During model loading time in your device it'll automatically create a cached version of the model. snpe-dlc-quantize --input_dlc whisper_encoder_fp32.dlc --input_list list.txt --output_dlc whisper_tiny_encoder_w8a16.dlc --weights_bitwidth 8 --act_bitwidth 16

snpe-dlc-quantize --input_dlc whisper_encoder_fp32.dlc --input_list list.txt --output_dlc whisper_tiny_encoder_w8a16.dlc --weights_bitwidth 8 --act_bitwidth 8

Hope this will solve your issue, please let me know.

Thanks

quic-rneti commented 3 months ago

waiting for user to confirm, if issue is solved or not.

cfasana commented 3 months ago

Hii @cfasana , Thanks a lot for the detail. Can you please use these below code to create w816 and w8a8 model, this will ensure you're not creating any cached model. During model loading time in your device it'll automatically create a cached version of the model. snpe-dlc-quantize --input_dlc whisper_encoder_fp32.dlc --input_list list.txt --output_dlc whisper_tiny_encoder_w8a16.dlc --weights_bitwidth 8 --act_bitwidth 16

snpe-dlc-quantize --input_dlc whisper_encoder_fp32.dlc --input_list list.txt --output_dlc whisper_tiny_encoder_w8a16.dlc --weights_bitwidth 8 --act_bitwidth 8

Hope this will solve your issue, please let me know.

Thanks

Hello @SahinMjks, it seems like the issue was indeed a cached version of the model. Using the commands above, the logits are no longer zeroed for both the W8A8 and W8A16 models, and the transcription works fine. Thanks for the support on this!

However, the inference time seems around 1000ms. Is it correct? According to the latest release of the AI Hub, Whisper Tiny should run in less than 100ms. Is the DSP being used or is there a part of the picture I'm not understanding?

To measure the inference time, I added a couple of lines to the code of MainActivity.java, before the comment line Running the Encoder Model on DSP and after the comment Inferencing the TFLite Model: Is this way of measuring the inference time accurate enough?

super100pig commented 3 months ago

Hii @cfasana , Thanks a lot for the detail. Can you please use these below code to create w816 and w8a8 model, this will ensure you're not creating any cached model. During model loading time in your device it'll automatically create a cached version of the model. snpe-dlc-quantize --input_dlc whisper_encoder_fp32.dlc --input_list list.txt --output_dlc whisper_tiny_encoder_w8a16.dlc --weights_bitwidth 8 --act_bitwidth 16 snpe-dlc-quantize --input_dlc whisper_encoder_fp32.dlc --input_list list.txt --output_dlc whisper_tiny_encoder_w8a16.dlc --weights_bitwidth 8 --act_bitwidth 8 Hope this will solve your issue, please let me know. Thanks

Hello @SahinMjks, it seems like the issue was indeed a cached version of the model. Using the commands above, the logits are no longer zeroed for both the W8A8 and W8A16 models, and the transcription works fine. Thanks for the support on this!

However, the inference time seems around 1000ms. Is it correct? According to the latest release of the AI Hub, Whisper Tiny should run in less than 100ms. Is the DSP being used or is there a part of the picture I'm not understanding?

To measure the inference time, I added a couple of lines to the code of MainActivity.java, before the comment line Running the Encoder Model on DSP and after the comment Inferencing the TFLite Model: Is this way of measuring the inference time accurate enough?

To my understanding, 53 ms is only the whisper-tiny-encoder time. Total inference time should be encoder_time + decoder_time * token_count.

quic-rneti commented 4 weeks ago

Please post any AI hub questions in AI hub forums. AI hub will just run the model on HTP. Here in android application, there will be few optimizations needed. We have a separate repository ai-hub-apps in progress for optimal demonstration of AI hub models.

quic / qidk

Zeroed logits with Whisper quantized model #18