Open FredericOdermatt opened 1 month ago
also able to confirm this, also get it with flashinfer on vllm,
@FredericOdermatt @jonzhep This is very helpful. We will take a close look this week and hopefully fix it soon.
This is what I got when running your example commands (Normal server start) on 8xH100 with the current main (87a7cfa080cec3f123618c1429)
It can basically reproduce what you said, although not as bad as what you show. I will start investigation. May I know the hardware you are using? You can also get that by running python3 -m sglang.check_env
I was running this on either 8 RTX A6000, or 4 A100's. The plot above is from the RTX's
This has been one of the biggest issues we've known about for a while. In short, I believe that dynamic batching introduces these variances because different batch sizes dispatch different kernels. We checked the engine implementation and did not find any noticeable bugs (e.g., incorrect caching). We will continue investigating and may introduce a "deterministic mode" as a short-term solution. This mode will use additional padding to increase determinism, although it will run more slowly.
This has been one of the biggest issues we've known about for a while. In short, I believe that dynamic batching introduces these variances because different batch sizes dispatch different kernels. We checked the engine implementation and did not find any noticeable bugs (e.g., incorrect caching). We will continue investigating and may introduce a "deterministic mode" as a short-term solution. This mode will use additional padding to increase determinism, although it will run more slowly.
This is very helpful! I raised a similar issue in vllm https://github.com/vllm-project/vllm/issues/10074 and I think this is the same reason for that.
BTW, I believe that chunked prefill may increase the likelihood of the variance, as I've observed in my case with vllm. The default strategy in vllm which uses first-come-first-serve and prioritizes prefill requests, tends to mask this variance(batch size should be more likely to be consistent between two prefill execution at separate runs)
Checklist
Describe the bug
Background
This bug might be related to #1316.
When asking the model a block of questions it should answer with
yes
followed by a block of questions that should be answered byno
a degradation in quality can be observed for some runs, when running the same data many times.Standard
lmsysorg/sglang:v0.3.3.post1-cu121-srt
Asking 200 times the same 40 yes, 40 no questions and recording logit averages. Blue: questions that should be answered yes: average yes logit (post-softmax) Orange: questions that should be answered no: average yes logit (post-softmax). (please check the minimal reproducible sample here)
Restricted
lmsysorg/sglang:v0.3.3.post1-cu121-srt
Adding the following flags and running 100 times:
Observations
yes
token for questions that should be answered with yes when set up correctly.v0.2.6
equallyyes
(simply commenting out the 40 questions that should be answered withno
) This observation makes me suspect a caching mechanismFurther notes
Reproduction
Current minimal reproducible example here
Normal server start
python3 -m sglang.launch_server --model-path mistralai/Mixtral-8x22B-Instruct-v0.1 --random-seed 42 --tp-size 8 --dp-size 1 --host 0.0.0.0 --port 30001
Restricted server start
python3 -m sglang.launch_server --model-path mistralai/Mixtral-8x22B-Instruct-v0.1 --attention-backend triton --sampling-backend pytorch --disable-radix-cache --disable-regex-jump-forward --disable-cuda-graph --disable-cuda-graph-padding --disable-disk-cache --disable-custom-all-reduce --disable-mla --random-seed 42 --tp-size 8 --dp-size 1 --host 0.0.0.0 --port 30001
Environment
Environment for problematic runs
lmsysorg/sglang:v0.3.3.post1-cu121-srt