Open nigelzzz opened 1 month ago
Hi @nigelzzz, that is updating the kv_cache so that we do not do extraneous calculations when performing the internal matrix multiplications of the transformer model, here's a good article that explains it: https://medium.com/@joaolages/kv-caching-explained-276520203249 (Also has some information on scaled dot product attention).
@pkgoogle ,
I got it, thanks.
If scale_dot_product_attention
no has op (mult, soft), how to speed up the caulation, (maybe use npu's op(mult...))
Hi @pkgoogle , "In my understanding, scale_dot_product_attention is equivalent to a combination of matmul, softmax, mask, and scale operations. If I want to break down scale_dot_product_attention into these four operations, how can I modify the code?"
Hi @nigelzzz, I would read/go-through the Generative API: https://github.com/google-ai-edge/ai-edge-torch/tree/main/ai_edge_torch/generative, Do one example where you just reauthor the basic transformer block rather than changing anything for now. Make sure that works first. Review the toy_model: https://github.com/google-ai-edge/ai-edge-torch/blob/main/ai_edge_torch/generative/examples/test_models/toy_model.py. In the forward method is where you'll find how it's defined on the forward pass, you'll see it has a Transformer block. You'll find it defined here: https://github.com/google-ai-edge/ai-edge-torch/blob/main/ai_edge_torch/generative/layers/attention.py. Attention has numerous different forms, so it may not look exactly the same as the original paper. But keep digging and you'll find something similar.
Marking this issue as stale since it has been open for 7 days with no activity. This issue will be closed if no further activity occurs.
Description of the bug:
hi i have transfer tinyllama to tflite format, but when i use
https://netron.app/
, it can show to customer op, can i know the usage in tensorflow. when don't expand it (e.g., mul, add, softmax...)Actual vs expected behavior:
No response
Any other information you'd like to share?
No response