vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.74k stars 4.67k forks source link

[Usage]: output were empty #8773

Open wangwensuo opened 2 months ago

wangwensuo commented 2 months ago

Your current environment

The output of `python collect_env.py`

How would you like to use vllm

I used llama3-8b and 70b to fine-tune a model. When testing the model, I used vllm for inference. Among 700 requests, 10 outputs were empty. What is the reason?

import requests import os import openai import logging from openai import OpenAI import argparse import re from utils.contant import *

class ChipexpertGenerateServer: """ 先定义初始化 在构建上下文 """ def init(self, muti_turn=5):

    # 多轮的轮次
    self.max_tokens = 1500      
    self.temperature = 0.2
    self.top_p = 0.9
    self.n = 1
    self.muti_turn = muti_turn
    openai_api_base = 'http://10.1.12.91:8111' + '/v1'
    openai_api_key = "EMPTY"
    # 确定一下模型名称
    self.client = OpenAI(
        api_key=openai_api_key,
        base_url=openai_api_base,
    )
    response = requests.get(openai_api_base + '/models')
    self.model_name = response.json()['data'][0]['id']
    print(self.model_name)

    self.messages=[
        {"role": "system", "content": "A chat between a curious user and an artificial intelligence assistant.The assistant gives helpful, detailed, and polite answers to the user's questions."}
    ]

def predict(self, query):
    # 对于请求的query进行解析
    if len(self.messages) > 2 * self.muti_turn:
        self.messages.pop(1)
        self.messages.pop(2)

    self.messages.append({"role": "user", "content": query})

    chat_response = self.client.chat.completions.create(
        model=self.model_name,
        messages=self.messages,
        max_tokens=self.max_tokens,
        temperature=0.2,
        top_p=0.9,
        n=self.n,
        stream=True,
        # extra_body={
        #     "top_k": 5,
        #     "top_p": 0.9,
        #     "repetition_penalty": 1.1
        # }
    )
    assistant = ''
    for chunk in chat_response:
        chunk_message = chunk.choices[0].delta.content
        if chunk_message:
            assistant += chunk_message
            yield chunk_message
    self.messages.append({"role": "assistant", "content": assistant})

if name == "main":

# 
b = ChipexpertGenerateServer()

query = 'A few basic parameters of the antenna'
# # query = 'How is the overall gain of multiple amplifiers connected in cascade calculated, and why does it result in a significant increase in gain compared to individual stages?'
# query = "Assume µnCox = 100 µA/V2 and supply current is 5mA, what should be the aspect ratio so that a 50 Ω load can be used to give a voltage gain of .25 in C.D. configuration? A 32.6 B 50 C 40 D 41"
for item in b.predict(query):
    print(item)

print(b.messages)

# for item in b.predict('please explan it'):
#     print(item)

# print(b.messages)

output: [{'role': 'system', 'content': "A chat between a curious user and an artificial intelligence assistant.The assistant gives helpful, detailed, and polite answers to the user's questions."}, {'role': 'user', 'content': 'A few basic parameters of the antenna'}, {'role': 'assistant', 'content': '\u200b'}]

Before submitting a new issue...

hmellor commented 2 months ago

You are yielding out of predict with the first token in the response which:

This is an issue with the way you are using the OpenAI Python client not vLLM.

    assistant = ''
    for chunk in chat_response:
        chunk_message = chunk.choices[0].delta.content
        if chunk_message:
            assistant += chunk_message
            yield chunk_message
    self.messages.append({"role": "assistant", "content": assistant})
wangwensuo commented 2 months ago
    chat_response = self.client.chat.completions.create(
        model=self.model_name,
        messages=self.messages,
        max_tokens=200,
        temperature=0.2,
        top_p=0.9,
        n=self.n,
        stream=False,
        extra_body={
            "ignore_eos" : True
        }
    )
    response_text = chat_response.choices[0].message.content
    self.messages.append({"role": "assistant", "content": response_text})
    return response_text
    # print(response_text)
    # assistant = ''
    # for chunk in chat_response:
    #     chunk_message = chunk.choices[0].delta.content
    #     if chunk_message:
    #         assistant += chunk_message
    #         yield chunk_message
    # self.messages.append({"role": "assistant", "content": assistant})

if name == "main":

# 
b = ChipexpertGenerateServer()

query = 'A few basic parameters of the antenna'
print(b.predict(query))
# # query = 'How is the overall gain of multiple amplifiers connected in cascade calculated, and why does it result in a significant increase in gain compared to individual stages?'
# query = "Assume µnCox = 100 µA/V2 and supply current is 5mA, what should be the aspect ratio so that a 50 Ω load can be used to give a voltage gain of .25 in C.D. configuration? A 32.6 B 50 C 40 D 41"
# for item in b.predict(query):
#     print(item)

print(b.messages)

this is not stream, But the problem remains the output is \u200b

[{'role': 'system', 'content': "A chat between a curious user and an artificial intelligence assistant.The assistant gives helpful, detailed, and polite answers to the user's questions."}, {'role': 'user', 'content': 'A few basic parameters of the antenna'}, {'role': 'assistant', 'content': "\u200b<|eot_id|><|eot_id|>\nassistant\nSure, here are a few basic parameters of an antenna:\n\n1. Gain: The ratio of the power radiated by the antenna in a particular direction to the power radiated by a hypothetical isotropic antenna in the same direction. Gain is usually expressed in decibels (dB).\n2. Directivity: The ratio of the power radiated by the antenna in a particular direction to the power radiated by the antenna averaged over all directions. Directivity is also expressed in decibels (dB).\n3. Radiation pattern: A graphical representation of the antenna's radiation properties as a function of direction. The radiation pattern shows the relative strength of the radiated signal in different directions.\n4. Bandwidth: The range of frequencies over which the antenna operates effectively. Bandwidth is usually expressed as a percentage of the center frequency.\n5. Impedance: The ratio of the voltage to the current at the antenna's input terminals. Impedance is usually expressed in"}]

INFO 09-24 13:08:14 async_llm_engine.py:173] Added request chat-85ee29268e154638ad4cd631b26657c1.

INFO 09-24 13:08:14 logger.py:36] Received request chat-85ee29268e154638ad4cd631b26657c1: prompt: "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\nA chat between a curious user and an artificial intelligence assistant.The assistant gives helpful, detailed, and polite answers to the user's questions.<|eot_id|>\n<|start_header_id|>user<|end_header_id|>\nA few basic parameters of the antenna<|eot_id|>\n<|start_header_id|>assistant<|end_header_id|>\n", params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.2, top_p=0.9, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=True, max_tokens=200, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [128000, 128006, 9125, 128007, 198, 32, 6369, 1990, 264, 22999, 1217, 323, 459, 21075, 11478, 18328, 11829, 18328, 6835, 11190, 11, 11944, 11, 323, 48887, 11503, 311, 279, 1217, 596, 4860, 13, 128009, 198, 128006, 882, 128007, 198, 32, 2478, 6913, 5137, 315, 279, 41032, 128009, 198, 128006, 78191, 128007, 198], lora_request: None, prompt_adapter_request: None.

INFO: 10.10.10.166:58088 - "GET /v1/models HTTP/1.1" 200 OK

hmellor commented 2 months ago

I used llama3-8b and 70b to fine-tune a model. When testing the model, I used vllm for inference. Among 700 requests, 10 outputs were empty.

Does this happen when testing the model using something like HF pipelines?

Since you have trained the model yourself, it's possible that sometimes the model is actually doing this.

hmellor commented 2 months ago

For reference U+200B looks like:

image

This appears to be a loading wheel.

Is it possible that the model was fine-tuned on data that wasn't fully loaded when it was saved/scraped? Hence the loading wheel?

wangwensuo commented 2 months ago

I don't quite understand what it means. It means that there are similar problems when saving in our training set??

hmellor commented 2 months ago

Yes, I mean that your model might be fine-tuned to generate a loading wheel sometimes because of an issue in the data pipeline during {training/data saving}.

shudct commented 1 month ago

I meet this problem too. when I use GLM4-9B-Chat without finetuning, it's still has this situation. How can I solve it? Thx