tflite op ( attention, or kv update)

google-ai-edge / ai-edge-torch

Supporting PyTorch models with the Google AI Edge TFLite runtime.

Apache License 2.0

281 stars 36 forks source link

tflite op ( attention, or kv update) #148

Open nigelzzz opened 1 month ago

nigelzzz commented 1 month ago

Description of the bug:

hi i have transfer tinyllama to tflite format, but when i use https://netron.app/, it can show to customer op, can i know the usage in tensorflow. when don't expand it (e.g., mul, add, softmax...)

Actual vs expected behavior:

No response

Any other information you'd like to share?

No response

pkgoogle commented 1 month ago

Hi @nigelzzz, that is updating the kv_cache so that we do not do extraneous calculations when performing the internal matrix multiplications of the transformer model, here's a good article that explains it: https://medium.com/@joaolages/kv-caching-explained-276520203249 (Also has some information on scaled dot product attention).

nigelzzz commented 4 weeks ago

@pkgoogle , I got it, thanks. If scale_dot_product_attention no has op (mult, soft), how to speed up the caulation, (maybe use npu's op(mult...))

nigelzzz commented 1 week ago

Hi @pkgoogle , "In my understanding, scale_dot_product_attention is equivalent to a combination of matmul, softmax, mask, and scale operations. If I want to break down scale_dot_product_attention into these four operations, how can I modify the code?"

pkgoogle commented 1 week ago

Hi @nigelzzz, I would read/go-through the Generative API: https://github.com/google-ai-edge/ai-edge-torch/tree/main/ai_edge_torch/generative, Do one example where you just reauthor the basic transformer block rather than changing anything for now. Make sure that works first. Review the toy_model: https://github.com/google-ai-edge/ai-edge-torch/blob/main/ai_edge_torch/generative/examples/test_models/toy_model.py. In the forward method is where you'll find how it's defined on the forward pass, you'll see it has a Transformer block. You'll find it defined here: https://github.com/google-ai-edge/ai-edge-torch/blob/main/ai_edge_torch/generative/layers/attention.py. Attention has numerous different forms, so it may not look exactly the same as the original paper. But keep digging and you'll find something similar.

github-actions[bot] commented 3 days ago

Marking this issue as stale since it has been open for 7 days with no activity. This issue will be closed if no further activity occurs.