sgl-project / sglang

SGLang is yet another fast serving framework for large language models and vision language models.
Apache License 2.0
2.87k stars 188 forks source link

flushing cache effect on throughput #268

Open amirarsalan90 opened 4 months ago

amirarsalan90 commented 4 months ago

When running a model with --model-mode flashinfer (I have tested mistralai/Mistral-7B-Instruct-v0.2), for a large batch (eg 50,000 text input), I usually see that the throughput is high the first few minutes and then it starts degrading.

Does using flush_cache every few iterations (like in steps of 256 iterations) create any improvement? Does this make sense to create bathces of 256 and do http://0.0.0.0:80000/flush_cache upon sending the request for each batch?

comaniac commented 4 months ago

That depends on your use cases. If you only get high throughput in the first few requests, I imagine your requests have less common prompts so that you barely benefit from RadixAttention but suffer from cache eviction overhead. In this case flushing cache may help with throughput.

amirarsalan90 commented 4 months ago

Thanks! My requests all have a common prefix (sys prompt), but they are diverse after the prefix.

comaniac commented 4 months ago

That should still benefit from RadixAttention, so you shouldn't see throughput dropping after a while unless your system prompt is very short. Anyways you can firstly try to flush cache every N requests to see if that helps.

Qubitium commented 4 months ago

@comaniac This thread is fascinating. My naive question is how can cache evictions have such high cost for RadixAttention as to cause throughput slowdowns? I assume the caches are just mapped blocks of gpu memory (non-contagious) and on cache eviction one may need to del them and/or some segmented memory merges? Doesn't cuda offer async memory ops? Again, I come from the point of curiosity as how such ops can be a bottle neck.

comaniac commented 4 months ago

I agree with your point. As I don't have any more details about this case, this is just my guess. The real bottleneck of this throughout dropping can be anywhere and we'll need detail logs or reproducible example to dive into it.