vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
27.8k stars 4.1k forks source link

[Bug]: VLLM's output is unstable version==0.5.1 #6328

Open ffxmm opened 2 months ago

ffxmm commented 2 months ago

Your current environment

using version==0.5.1 docker images: command: using model qwen2-GPTQ-Int4 docker run -it --rm --gpus '"device=0,7,8,9"' -p 8090:8090 -e NCCL_P2P_DISABLE=1 -e NCCL_SHM_DISABLE=1 -v /nfs2:/nfs2 -v /var:/var -v /nfs3:/nfs3 -v /nfs5:/nfs5 --shm-size 20g XXXXXXXXX:XXXXXXX python3 -m vllm.entrypoints.openai.api_server --host=0.0.0.0 --port=8090 --model=XXXX/source/deps --served-model-name=qwen2-GPTQ-Int4 --gpu-memory-utilization 0.9 --tensor-parallel-size 4 --seed 42;

🐛 Describe the bug

I generated 50 results by using http://XXXX/v1/chat/completions with params like:

   "model": "qwen2-GPTQ-Int4",
    "temperature": 0,
    "n": 1,
    "best_of":1,
    "presence_penalty":0.0,
    "frequency_penalty":0.0,
    "repetition_penalty":1.0,
    "top_p": 1.0,
    "top_k":1.0,
    "min_p":0.0

But I got different results eventhough most of them are the same(>90% are same), . Results LIKE: ['P41T33', 'P76T139', 'P76T140', 'P77T142', 'P111T257', 'P111T260', 'P111T261'] ['P41T33', 'P76T139', 'P76T140', 'P77T142', 'P111T257', 'P111T260', 'P111T261'] ['P41T33', 'P76T139', 'P76T140', 'P77T142', 'P111T257', 'P111T260', 'P111T261'] ['P41T33', 'P76T139', 'P77T142', 'P111T257', 'P111T260', 'P111T261']

The instability increases with inputs larger than 8K. When I use 30K inputs, less than 70% of the results are stable

I'm not sure what's causing this. Is it due to the quantized version, or something else? Could you please help check if there's an issue with the parameters? If stability in output is desired, where can I make adjustments?

ShangmingCai commented 2 months ago

I think this is a duplicate issue related to (#5404). Also, it has nothing to do with version 0.5.1, this bug has been discussed many times across previous versions.

In your case, I believe quantization is the problem. The hidden_state inputs for the logits_procesor are unstable for identical prompts. Maybe you can test with an unquantized version and see if this bug could be reproduced.

akai-shuuichi commented 2 months ago

The efficient GPU algorithm implementation of GPTQ quantization models, many of which are not deterministic algorithms, such as autogptq implementation or exoma implementation, both use atomikadd. Different order of floating-point number accumulation can lead to result fluctuations due to accuracy errors. AWQ can be considered to avoid it, and our testing shows that it is feasible