Closed Qubitium closed 2 weeks ago
Yes, this variable is very sensitive and we do not have a good heuristic for setting it.
We tested 0.0006
on some NVIDIA A10G/A100/H100 setups with different CPUs and find it is a good value. As you said, it depends on many factors and we will do more test later when we start to optimize for the extreme latency.
@Ying1123 @merrymercy Tagging the last 2 person that modified
wait_for_new_request_delay
.I have noticed the delay config getting tweaked multiple times. Since this var is sensitive to deployment environment (network condition) + gpu inference speed I would like to know under what hardware env was this
0.0006
optimized for? I have some thoughts on how to refractor this away but need more understanding on how sglang is using this to max fill batch/throughput and why0.0006
is used as the default/standard. Thanks.https://github.com/sgl-project/sglang/blob/fb9296f0ed07f4b9fd41f5bd9c670d5a607ae46a/python/sglang/global_config.py#L30