sgl-project / sglang

SGLang is a structured generation language designed for large language models (LLMs). It makes your interaction with models faster and more controllable.
Apache License 2.0
2.75k stars 176 forks source link

Clarification for wait_for_new_request_delay changes #541

Closed Qubitium closed 2 weeks ago

Qubitium commented 2 weeks ago

@Ying1123 @merrymercy Tagging the last 2 person that modified wait_for_new_request_delay.

I have noticed the delay config getting tweaked multiple times. Since this var is sensitive to deployment environment (network condition) + gpu inference speed I would like to know under what hardware env was this 0.0006optimized for? I have some thoughts on how to refractor this away but need more understanding on how sglang is using this to max fill batch/throughput and why 0.0006 is used as the default/standard. Thanks.

https://github.com/sgl-project/sglang/blob/fb9296f0ed07f4b9fd41f5bd9c670d5a607ae46a/python/sglang/global_config.py#L30

merrymercy commented 2 weeks ago

Yes, this variable is very sensitive and we do not have a good heuristic for setting it.

We tested 0.0006 on some NVIDIA A10G/A100/H100 setups with different CPUs and find it is a good value. As you said, it depends on many factors and we will do more test later when we start to optimize for the extreme latency.