Open wangwensuo opened 1 month ago
You are yield
ing out of predict
with the first token in the response which:
self.messages.append({"role": "assistant", "content": assistant})
is never calledThis is an issue with the way you are using the OpenAI Python client not vLLM.
assistant = ''
for chunk in chat_response:
chunk_message = chunk.choices[0].delta.content
if chunk_message:
assistant += chunk_message
yield chunk_message
self.messages.append({"role": "assistant", "content": assistant})
chat_response = self.client.chat.completions.create(
model=self.model_name,
messages=self.messages,
max_tokens=200,
temperature=0.2,
top_p=0.9,
n=self.n,
stream=False,
extra_body={
"ignore_eos" : True
}
)
response_text = chat_response.choices[0].message.content
self.messages.append({"role": "assistant", "content": response_text})
return response_text
# print(response_text)
# assistant = ''
# for chunk in chat_response:
# chunk_message = chunk.choices[0].delta.content
# if chunk_message:
# assistant += chunk_message
# yield chunk_message
# self.messages.append({"role": "assistant", "content": assistant})
if name == "main":
#
b = ChipexpertGenerateServer()
query = 'A few basic parameters of the antenna'
print(b.predict(query))
# # query = 'How is the overall gain of multiple amplifiers connected in cascade calculated, and why does it result in a significant increase in gain compared to individual stages?'
# query = "Assume µnCox = 100 µA/V2 and supply current is 5mA, what should be the aspect ratio so that a 50 Ω load can be used to give a voltage gain of .25 in C.D. configuration? A 32.6 B 50 C 40 D 41"
# for item in b.predict(query):
# print(item)
print(b.messages)
this is not stream, But the problem remains the output is \u200b
[{'role': 'system', 'content': "A chat between a curious user and an artificial intelligence assistant.The assistant gives helpful, detailed, and polite answers to the user's questions."}, {'role': 'user', 'content': 'A few basic parameters of the antenna'}, {'role': 'assistant', 'content': "\u200b<|eot_id|><|eot_id|>\nassistant\nSure, here are a few basic parameters of an antenna:\n\n1. Gain: The ratio of the power radiated by the antenna in a particular direction to the power radiated by a hypothetical isotropic antenna in the same direction. Gain is usually expressed in decibels (dB).\n2. Directivity: The ratio of the power radiated by the antenna in a particular direction to the power radiated by the antenna averaged over all directions. Directivity is also expressed in decibels (dB).\n3. Radiation pattern: A graphical representation of the antenna's radiation properties as a function of direction. The radiation pattern shows the relative strength of the radiated signal in different directions.\n4. Bandwidth: The range of frequencies over which the antenna operates effectively. Bandwidth is usually expressed as a percentage of the center frequency.\n5. Impedance: The ratio of the voltage to the current at the antenna's input terminals. Impedance is usually expressed in"}]
INFO 09-24 13:08:14 async_llm_engine.py:173] Added request chat-85ee29268e154638ad4cd631b26657c1.
INFO 09-24 13:08:14 logger.py:36] Received request chat-85ee29268e154638ad4cd631b26657c1: prompt: "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\nA chat between a curious user and an artificial intelligence assistant.The assistant gives helpful, detailed, and polite answers to the user's questions.<|eot_id|>\n<|start_header_id|>user<|end_header_id|>\nA few basic parameters of the antenna<|eot_id|>\n<|start_header_id|>assistant<|end_header_id|>\n", params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.2, top_p=0.9, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=True, max_tokens=200, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [128000, 128006, 9125, 128007, 198, 32, 6369, 1990, 264, 22999, 1217, 323, 459, 21075, 11478, 18328, 11829, 18328, 6835, 11190, 11, 11944, 11, 323, 48887, 11503, 311, 279, 1217, 596, 4860, 13, 128009, 198, 128006, 882, 128007, 198, 32, 2478, 6913, 5137, 315, 279, 41032, 128009, 198, 128006, 78191, 128007, 198], lora_request: None, prompt_adapter_request: None.
INFO: 10.10.10.166:58088 - "GET /v1/models HTTP/1.1" 200 OK
I used llama3-8b and 70b to fine-tune a model. When testing the model, I used vllm for inference. Among 700 requests, 10 outputs were empty.
Does this happen when testing the model using something like HF pipelines?
Since you have trained the model yourself, it's possible that sometimes the model is actually doing this.
For reference U+200B looks like:
This appears to be a loading wheel.
Is it possible that the model was fine-tuned on data that wasn't fully loaded when it was saved/scraped? Hence the loading wheel?
I don't quite understand what it means. It means that there are similar problems when saving in our training set??
Yes, I mean that your model might be fine-tuned to generate a loading wheel sometimes because of an issue in the data pipeline during {training/data saving}.
I meet this problem too. when I use GLM4-9B-Chat without finetuning, it's still has this situation. How can I solve it? Thx
Your current environment
How would you like to use vllm
I used llama3-8b and 70b to fine-tune a model. When testing the model, I used vllm for inference. Among 700 requests, 10 outputs were empty. What is the reason?
import requests import os import openai import logging from openai import OpenAI import argparse import re from utils.contant import *
class ChipexpertGenerateServer: """ 先定义初始化 在构建上下文 """ def init(self, muti_turn=5):
if name == "main":
output: [{'role': 'system', 'content': "A chat between a curious user and an artificial intelligence assistant.The assistant gives helpful, detailed, and polite answers to the user's questions."}, {'role': 'user', 'content': 'A few basic parameters of the antenna'}, {'role': 'assistant', 'content': '\u200b'}]
Before submitting a new issue...