Open zhengwei-gao opened 4 months ago
This seems to be the same issues as #4091 that I posted yesterday, and #1182 that was previously closed. Any advice or fixes to make results deterministic when doing concurrent API calls would be appreciated!
I also met this issue with latest version
Also seeing this issue.
Can I also add that this is a fundamentally breaking problem.
Not being able to send more than 1 request to vLLM at a time renders it pretty much unusable in almost all circumstances. This is a massive breaking bug and really should be pretty high priority to diagnose and fix.
The whole strength of vLLM is its ability to handle multiple requests that is what makes it pretty damn awesome - but here we are in a position where we can't send it multiple requests.
This is expected behaviour. Effective ordering of floating point operations varies when particular sequences are batched differently. Internally, different algs are used for the matmul operations depending on the batch size, etc. Though mathematically equivalent it causes differences due to the limited precision .. floating point multiplication is not associative. The differences become more likely the more output tokens you generate (because the "errors" accumulate). Once a difference token is chosen it's almost certain to diverge.
You can reduce the variability by using float16 rather than bfloat16 (pass --dtype=float16
). You could also try float32 which should be even more stable but will require double the memory and the performance might be worse.
That doesn't make sense.
I can send the same thing over and over and over again and get exactly the same result, if I do it one at a time.
Why would concurrent change the values but sending at different times doesn't?
If there was a discrepancy built into the mathematical interpretation of floating point numbers - then you would expect non-determinism whenever you made calls - you would just expect some of them to be different.
We have run 100s and they are always 100% the same unless we do it concurrently.
Edit:
Even if you are right - and the answer is, well the 100s of runs you did individually were insufficient to see the floating point discrepancy, but it is there, you would just need to run 1,000s of individual runs....
Great - but why then is it immediately visible when doing concurrency?
If it is a 1/1,000 individually - why is it 1/2 for concurrency?
There must be something else going on.
Vote for this issue. it's very important. In my case, the input token is very long (let's say 1000) and only one token is output. different results may also be returned in different requests when the concurrency is high.
Your current environment
using lastest docker images: command: docker run --runtime nvidia --gpus all -v /mnt_1/models:/models -p 8000:8000 --ipc=host vllm/vllm-openai:latest --model /models/Qwen/Qwen1.5-14B-Chat --served-model-name Qwen --gpu-memory-utilization 0.99 --max-model-len 8192 --tensor-parallel-size 1 --seed 42
🐛 Describe the bug
I generated 10 results concurrently by following code. But I got different results with temperate= 0 eventhough most of them are the same.
following are the printed results: