Closed ZhongYingMatrix closed 15 hours ago
It's question better to ask in flashinfer but I'd love to clarify here:
Ragged attention APIs accept keys/values stored in ragged tensors (each request might have different number of requests but the kv-cache storage of each request is still contiguous), paged attention APIs accept keys/values stored in sparse tensors (page-table is a special form of sparse tensor), which has some overhead of sparse data loading inside kernels. The gap between ragged attention API and paged attention API is around 10+%, depending on hardware.
We encourage using paged attention API for key/values already stored in page table, and using ragged attention API for contiguous keys/values (for example, in pure prefill stage, keys/values tensors are contiguous before we appending them to paged KV-cache).
It's question better to ask in flashinfer but I'd love to clarify here:
Ragged attention APIs accept keys/values stored in ragged tensors (each request might have different number of requests but the kv-cache storage of each request is still contiguous), paged attention APIs accept keys/values stored in sparse tensors (page-table is a special form of sparse tensor), which has some overhead of sparse data loading inside kernels. The gap between ragged attention API and paged attention API is around 10+%, depending on hardware.
We encourage using paged attention API for key/values already stored in page table, and using ragged attention API for contiguous keys/values (for example, in pure prefill stage, keys/values tensors are contiguous before we appending them to paged KV-cache).
Thank you for your quick reply! I have made some comparisons using nvbench in this issue. I will go ahead and close this issue.
Hi,
Thank you for your excellent work. I noticed that using FlashInfer in the prefill stage involves two wrappers and a heuristic based on token count. Specifically, I found this in the code:
https://github.com/sgl-project/sglang/blob/4af3f889fc6f406c0fc3b7a310e3ad7220b01ff6/python/sglang/srt/layers/attention/flashinfer_backend.py#L138
I am curious about the differences between these two wrappers and how much they affect performance. Could someone help me understand this? Thank you very much!