Clarification for wait_for_new_request_delay changes

sgl-project / sglang

SGLang is a structured generation language designed for large language models (LLMs). It makes your interaction with models faster and more controllable.

Apache License 2.0

2.75k stars 176 forks source link

@Ying1123 @merrymercy Tagging the last 2 person that modified wait_for_new_request_delay.

I have noticed the delay config getting tweaked multiple times. Since this var is sensitive to deployment environment (network condition) + gpu inference speed I would like to know under what hardware env was this 0.0006optimized for? I have some thoughts on how to refractor this away but need more understanding on how sglang is using this to max fill batch/throughput and why 0.0006 is used as the default/standard. Thanks.

https://github.com/sgl-project/sglang/blob/fb9296f0ed07f4b9fd41f5bd9c670d5a607ae46a/python/sglang/global_config.py#L30

sgl-project / sglang

Clarification for wait_for_new_request_delay changes #541