Open beam-me-up-scotchy opened 7 months ago
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!
Your current environment
๐ Describe the bug
When querying vLLM as a server, running syncronous requests one-at-a-time results in deterministic output. However, when running concurrent request, outputs become non-deterministic.
This seems to be the same issues mentioned here: #1182
How can we ensure deterministic outputs when running concurrent requests?
Please see synchronous and asynchronous scripts running a dummy translate task below to replicate (remember to replace IP address with address of endpoint machine):
starting server on endpoint machine
python -m vllm.entrypoints.openai.api_server --model /data/huggingface/Mistral-7B-Instruct-v0.2 --tensor-parallel-size=4 --disable-log-stats --disable-log-requests
test_chat_completion.py
Output: "Is every output the same? True"
test_async_public.py
Output: "Is every output the same? False"
public_test_prompt.txt