lm-sys / FastChat

An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.
Apache License 2.0
35.92k stars 4.41k forks source link

When I call the api '/v1/chat/completions' of API Server, it response incomplete results, but vllm's api response complete results #3074

Open coreyho opened 5 months ago

coreyho commented 5 months ago

When I call the api '/v1/chat/completions' of API Server to access vllm_worker server , it response incomplete results, but vllm's api response complete results and model_work server response complete results

env

Detail

When calling the api '/v1/chat/completions' of API Server, it response incomplete results "Hello! How can I assis"

curl --location --request POST 'http://localhost:8000/v1/chat/completions' \
--header 'Content-Type: application/json' \
--data-raw '{
    "model": "Qwen1.5-72b-chat",
    "messages": [
        {
            "role": "user",
            "content": "hello"
        }
    ]
}'
{
    "id": "chatcmpl-BpWqkwhWbXyXbDYM2kqgdp",
    "object": "chat.completion",
    "created": 1708576975,
    "model": "Qwen1.5-72b-chat",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "Hello! How can I assist "
            },
            "finish_reason": "stop"
        }
    ],
    "usage": {
        "prompt_tokens": 20,
        "total_tokens": 30,
        "completion_tokens": 10
    }
}

Directly calling the worker_generate_stream API, the output is out of order, and the complete output is not the last one.

The last output is incomplete results "text": "<|im_start|>system\nyou are a helpful assistant<|im_end|>\n<|im_start|>user\nhello<|im_end|>\n<|im_start|>assistant\nHello! How can I assis".

The third to last output is complete results "text": "<|im_start|>system\nyou are a helpful assistant<|im_end|>\n<|im_start|>user\nhello<|im_end|>\n<|im_start|>assistant\nHello! How can I assist you today?\n",

curl --location --request POST 'http://70.182.56.16:21004/worker_generate_stream' \
--header 'Content-Type: application/json' \
--data-raw '{
    "prompt": "<|im_start|>system\nyou are a helpful assistant<|im_end|>\n<|im_start|>user\nhello<|im_end|>\n<|im_start|>assistant\n"
}'
    {
        "text": "<|im_start|>system\nyou are a helpful assistant<|im_end|>\n<|im_start|>user\nhello<|im_end|>\n<|im_start|>assistant\nHello",
        "error_code": 0,
        "usage": {
            "prompt_tokens": 19,
            "completion_tokens": 1,
            "total_tokens": 20
        },
        "cumulative_logprob": [-0.5442918539047241],
        "finish_reason": null
    }, {
        "text": "<|im_start|>system\nyou are a helpful assistant<|im_end|>\n<|im_start|>user\nhello<|im_end|>\n<|im_start|>assistant\nHello!",
        "error_code": 0,
        "usage": {
            "prompt_tokens": 19,
            "completion_tokens": 2,
            "total_tokens": 21
        },
        "cumulative_logprob": [-0.6264679655432701],
        "finish_reason": null
    }, {
        "text": "<|im_start|>system\nyou are a helpful assistant<|im_end|>\n<|im_start|>user\nhello<|im_end|>\n<|im_start|>assistant\nHello! How",
        "error_code": 0,
        "usage": {
            "prompt_tokens": 19,
            "completion_tokens": 3,
            "total_tokens": 22
        },
        "cumulative_logprob": [-0.63395517738536],
        "finish_reason": null
    }, {
        "text": "<|im_start|>system\nyou are a helpful assistant<|im_end|>\n<|im_start|>user\nhello<|im_end|>\n<|im_start|>assistant\nHello! How can",
        "error_code": 0,
        "usage": {
            "prompt_tokens": 19,
            "completion_tokens": 4,
            "total_tokens": 23
        },
        "cumulative_logprob": [-0.7231398862786591],
        "finish_reason": null
    }, {
        "text": "<|im_start|>system\nyou are a helpful assistant<|im_end|>\n<|im_start|>user\nhello<|im_end|>\n<|im_start|>assistant\nHello! How can I",
        "error_code": 0,
        "usage": {
            "prompt_tokens": 19,
            "completion_tokens": 5,
            "total_tokens": 24
        },
        "cumulative_logprob": [-0.7231670656274218],
        "finish_reason": null
    }, {
        "text": "<|im_start|>system\nyou are a helpful assistant<|im_end|>\n<|im_start|>user\nhello<|im_end|>\n<|im_start|>assistant\nHello! How can I assist",
        "error_code": 0,
        "usage": {
            "prompt_tokens": 19,
            "completion_tokens": 6,
            "total_tokens": 25
        },
        "cumulative_logprob": [-1.6991723814080615],
        "finish_reason": null
    }, {
        "text": "<|im_start|>system\nyou are a helpful assistant<|im_end|>\n<|im_start|>user\nhello<|im_end|>\n<|im_start|>assistant\nHello! How can I assist you",
        "error_code": 0,
        "usage": {
            "prompt_tokens": 19,
            "completion_tokens": 7,
            "total_tokens": 26
        },
        "cumulative_logprob": [-1.6991882361180615],
        "finish_reason": null
    }, {
        "text": "<|im_start|>system\nyou are a helpful assistant<|im_end|>\n<|im_start|>user\nhello<|im_end|>\n<|im_start|>assistant\nHello! How can I assist you today",
        "error_code": 0,
        "usage": {
            "prompt_tokens": 19,
            "completion_tokens": 8,
            "total_tokens": 27
        },
        "cumulative_logprob": [-1.710257318773074],
        "finish_reason": null
    }, {
        "text": "<|im_start|>system\nyou are a helpful assistant<|im_end|>\n<|im_start|>user\nhello<|im_end|>\n<|im_start|>assistant\nHello! How can I assist you today?",
        "error_code": 0,
        "usage": {
            "prompt_tokens": 19,
            "completion_tokens": 9,
            "total_tokens": 28
        },
        "cumulative_logprob": [-1.7102782993879373],
        "finish_reason": null
    }, {
        "text": "<|im_start|>system\nyou are a helpful assistant<|im_end|>\n<|im_start|>user\nhello<|im_end|>\n<|im_start|>assistant\nHello! How can I assist you today?",
        "error_code": 0,
        "usage": {
            "prompt_tokens": 19,
            "completion_tokens": 10,
            "total_tokens": 29
        },
        "cumulative_logprob": [-1.8846533119049127],
        "finish_reason": null
    }, {
        "text": "<|im_start|>system\nyou are a helpful assistant<|im_end|>\n<|im_start|>user\nhello<|im_end|>\n<|im_start|>assistant\nHello! How can I assist you today?\n",
        "error_code": 0,
        "usage": {
            "prompt_tokens": 19,
            "completion_tokens": 11,
            "total_tokens": 30
        },
        "cumulative_logprob": [-1.8846546232061883],
        "finish_reason": null
    }, {
        "text": "<|im_start|>system\nyou are a helpful assistant<|im_end|>\n<|im_start|>user\nhello<|im_end|>\n<|im_start|>assistant\nHello! How can I assis",
        "error_code": 0,
        "usage": {
            "prompt_tokens": 19,
            "completion_tokens": 12,
            "total_tokens": 31
        },
        "cumulative_logprob": [-2.045250159438183],
        "finish_reason": null
    }, {
        "text": "<|im_start|>system\nyou are a helpful assistant<|im_end|>\n<|im_start|>user\nhello<|im_end|>\n<|im_start|>assistant\nHello! How can I assis",
        "error_code": 0,
        "usage": {
            "prompt_tokens": 19,
            "completion_tokens": 12,
            "total_tokens": 31
        },
        "cumulative_logprob": [-2.045250159438183],
        "finish_reason": "stop"
    }

When using vllm's OpenAI-compatible API service, it response complete results.

python3.9", "-m", "vllm.entrypoints.openai.api_server", "--model", "/Qwen/Qwen1.5-72B-Chat", "--port", "21002", "--gpu-memory-utilization", "0.98", "--tensor-parallel-size", "4

{
    "id": "cmpl-a10de9e3ba27482296eecc572b5623d6",
    "object": "chat.completion",
    "created": 443291,
    "model": "/Qwen/Qwen1.5-72B-Chat",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "Hello! How can I assist you today?\n"
            },
            "finish_reason": "stop"
        }
    ],
    "usage": {
        "prompt_tokens": 9,
        "total_tokens": 21,
        "completion_tokens": 12
    }
}

When using model_work, it response also complete results.

MCZ777 commented 5 months ago

Same to me, after deploying with vllm_woker, sometimes the response is incomplete, but running it a few times will produce complete response.

zhb1991nm commented 5 months ago

一样的情况,但是直接vllm推理可以输出完整

uuuidada commented 5 months ago

same to me,but stream mode is fine

kingdomad commented 5 months ago

一样的状况,回答不完整

uuuidada commented 5 months ago

It does work!

https://github.com/vllm-project/vllm/pull/3016