NLP examples with variable input len

escorciav commented 10 months ago

Dear team,

After taking a look at the NLP solutions: QA, and Sentiment analysis, none of them support varied input length.

Does QNN support variable input length tensors? New versions do digest graphs with opset >= 9, and opset >= 11 does support variable len axis. Thus, it's opt to your tooling :)

Moreover, the inference of recent NLPmodels, e.g., autoregressive Tansformer decoder, often rely on caching key & values. Could you please provide any guidance to better support such inference pipeline?

Thanks in advance & kudosto all the team for providing examples of AI models using Qualcomm hardware :)

quic-rneti commented 7 months ago

Hello Victor Escorcia Castillo -

Does QNN support variable input length tensors?

Current QNN release does not support variant input length tensors. We are making one End-to-End application with Whisper ASR model. It will be published shortly. If you would like to see examples for any other model (or architecture), let us know. We will try our best to make it happen.

escorciav commented 7 months ago

It'd be great to have access to all the demos showcased during the Snapdragon summit of 2023 :smile:

Feel free to tag me when you release them. Happy to cross-post in my social media to raise awareness of cool on-device ML resources :blush:

https://www.youtube.com/watch?v=h_vh7_n_OPs

ZodiacFRA commented 6 months ago

@quic-rneti Hi :)

We are making one End-to-End application with Whisper ASR model. It will be published shortly.

Any ETA on this? I'm quite interested as I'm trying to convert whisper to a usable .dlc file I've tried to convert onnx, tflite & torchscript models using the snpe scripts, without success, as they all lack specific ops or dynamic layer sizes

Have a great day

quic-rneti commented 6 months ago

We are working on this. You can expect the tutorials to be available sometime in next months. (2 weeks to 4 weeks is the lead time)

ZodiacFRA commented 6 months ago

Hi @quic-rneti thanks for the response :) Could you by any chance share a whisper script which can be converted to .dlc before releasing the fully fledged app? or the dlc file alone?

Have a great day!

ZodiacFRA commented 4 months ago

Hello @quic-rneti, Is the Whisper model published on Qualcomm's AI hub models repository the demo you were talking about?

From what I've seen, the provided code can only work on Android (as it's using TfLite, not QNN) and I was wondering:

Does TensorFlow Lite Hexagon delegate works on Linux?
Will TensorFlow Lite Hexagon delegate support the new snapdragon chips (gen 2 / 3)?
Will you release a QNN version handling variable input lengths? So we can run .dlc transformers without using tflite?

Have a nice day :)

PS: @escorciav did you find another way to get Whisper working on QNN?

escorciav commented 4 months ago

@ZodiacFRA I haven't tested Whisper, but please tag me (here:GH or Twitter) if you do it.

BTW, if they provide TFLite models & you plan to use QNN, just follow the tutorial :wink: . Preferably, follow the docs of the qnn that you're using. In my case, it's 2.19.0.x => opt/qcom/aistack/qnn/2.19.0.240124/docs/QNN/general/tutorial2.html#id1.

Disclaimer. I haven't personally used TFLite as an ML frontend. But, some colleagues have done.

Personal opinions/judgements: Food for thought

BTW, I have the impression that TFLite & ONNX are way more cross-platform than PyTorch.
Why do I say so? xnnpack was/is developed/maintained by Google. I'm assuming TFLite path to land on mobile/web/etc. is cleaner & better tested/maintained.
What's xnnpack? High-efficiency floating-point neural network inference operators for mobile, server, and Web

escorciav commented 4 months ago

@quic-rneti Is there any relevant example from the Qualcomm AI Hub related to the topic of the issue?

ZodiacFRA commented 4 months ago

@escorciav Thanks for the quick response :)

About the tutorial: It looks like the conversion step still requires fixed input lengths (like all the snpe-[tflite|onnx|pytorch]-to-dlc scripts), this prevents me from converting whisper as one of the self attention matrix dimensions is the number of generated tokens and cannot be fixed.
My understanding is that they published Whisper using TensorFlow Lite Hexagon delegate to workaround the SNPE fixed input length limitation, but their TensorFlow Lite Hexagon delegate only works on Android
I really don't have any preferences as to which software stack I use, my goal is to use Whisper on a SM8550 (SD gen 2) running Linux. From what I've understood QNN is the only way to do that?

escorciav commented 4 months ago

I see. Sorry that I can't help. Did you post your question in the Slack forum associated with Qualcomm AI Hub? perhaps consider giving a shot (deploy) the provided whisper (encoder | decoder) and make a post about it :shrug:

quic-rneti commented 4 months ago

Re-opening issue to address latest comments. Do consider raising a new issue.

Does TensorFlow Lite Hexagon delegate works on Linux? [Yes]
Will TensorFlow Lite Hexagon delegate support the new snapdragon chips (gen 2 / 3)? [Yes]
Will you release a QNN version handling variable input lengths? So we can run .dlc transformers without using tflite? [Not in short term plan - We understand the requirement but there is no short term ETA for this]

quic / qidk

NLP examples with variable input len #5