sgl-project / sglang

SGLang is a fast serving framework for large language models and vision language models.
https://sgl-project.github.io/
Apache License 2.0
6.22k stars 532 forks source link

Question about ragged wrapper #2172

Closed ZhongYingMatrix closed 15 hours ago

ZhongYingMatrix commented 17 hours ago

Hi,

Thank you for your excellent work. I noticed that using FlashInfer in the prefill stage involves two wrappers and a heuristic based on token count. Specifically, I found this in the code:

https://github.com/sgl-project/sglang/blob/4af3f889fc6f406c0fc3b7a310e3ad7220b01ff6/python/sglang/srt/layers/attention/flashinfer_backend.py#L138

I am curious about the differences between these two wrappers and how much they affect performance. Could someone help me understand this? Thank you very much!

yzh119 commented 17 hours ago

It's question better to ask in flashinfer but I'd love to clarify here:

Ragged attention APIs accept keys/values stored in ragged tensors (each request might have different number of requests but the kv-cache storage of each request is still contiguous), paged attention APIs accept keys/values stored in sparse tensors (page-table is a special form of sparse tensor), which has some overhead of sparse data loading inside kernels. The gap between ragged attention API and paged attention API is around 10+%, depending on hardware.

We encourage using paged attention API for key/values already stored in page table, and using ragged attention API for contiguous keys/values (for example, in pure prefill stage, keys/values tensors are contiguous before we appending them to paged KV-cache).

ZhongYingMatrix commented 15 hours ago

It's question better to ask in flashinfer but I'd love to clarify here:

Ragged attention APIs accept keys/values stored in ragged tensors (each request might have different number of requests but the kv-cache storage of each request is still contiguous), paged attention APIs accept keys/values stored in sparse tensors (page-table is a special form of sparse tensor), which has some overhead of sparse data loading inside kernels. The gap between ragged attention API and paged attention API is around 10+%, depending on hardware.

We encourage using paged attention API for key/values already stored in page table, and using ragged attention API for contiguous keys/values (for example, in pure prefill stage, keys/values tensors are contiguous before we appending them to paged KV-cache).

Thank you for your quick reply! I have made some comparisons using nvbench in this issue. I will go ahead and close this issue.