sgl-project / sglang

SGLang is a fast serving framework for large language models and vision language models.
https://sgl-project.github.io/
Apache License 2.0
6.22k stars 532 forks source link

test select concurrency #2165

Open qeternity opened 1 day ago

qeternity commented 1 day ago

This is further to some discussion in the Slack. Select under moderate concurrency is very unstable.

We discovered this investigating some other issues that we've experienced in recent versions of sglang.

I'm not sure where in the test suite this test is best suited, so happy to move it.

merrymercy commented 13 hours ago

Thanks for contributing the test case. This is a know problem https://sgl-project.github.io/references/faq.html#the-results-are-not-deterministic-even-with-a-temperature-of-0. If you are interested, please help us add a padded batching mode.

https://github.com/sgl-project/sglang/blob/538fa0ae135c4e7ef70c65439359eff7bec2b616/docs/references/faq.md?plain=1#L11

qeternity commented 13 hours ago

Hi @merrymercy - this is not a determinism bug. You can generate the same text with top_k=1 or with a regex, at much higher concurrency, and it will pass every time. This is an issue that is specific to select.

qeternity commented 12 hours ago

I added a regular gen test at much greater concurrency to illustrate the above. As you can see, the test is still only failing the select invocation. The way this test is configured is that they should be net equivalent, even with the different behavior of select (at least I think this is correct). Further, this applies to all choices sampling methods.

merrymercy commented 12 hours ago

I see. I think the real reason is also due to some determinism of the input logprob, because select depends on input logprobs. Can you use regex / normal decoding for your current use cases? We will probably not fix this issue if it is not a regression. We will revisit this later with a more fundamental solution.

qeternity commented 12 hours ago

Yes, we can. But this line of investigation actually started because we were seeing very flaky JSON generation. And unfortunately, this easily triggers at the level of traffic we serve in prod.

I fully appreciate batching and kernel non-determinism but this feels like there is a deeper issue.