Open samos123 opened 7 months ago
request-body-sanitized.json Attaching the request body that was used to reproduce
Looks like some issues with outlines's fsm copying and initialization. for this case using lm-format-enforcer might be better. https://github.com/vllm-project/vllm/pull/3868
There seems to be a bug with:
response_format = {
"type": "json_object"
},
from openai import OpenAI
import os
def prompt_json_completion(messages):
base_url = os.getenv("BASE_URL", "http://localhost:8000/v1")
api_key = os.getenv("API_KEY", "EMPTY")
max_tokens = os.getenv("MAX_TOKENS", 100)
client = OpenAI(api_key = api_key, base_url = base_url)
completion = client.chat.completions.create(
model = client.models.list().data[0].id,
# response_format = {
# "type": "json_object"
# },
messages = messages,
max_tokens = max_tokens,
)
#print(completion)
print(completion.choices[0].message.content)
if name == "main": user_prompt = "Generate example JSON data of a student in an SIS" messages = [ {"role": "user", "content": user_prompt} ] prompt_json_completion(messages=messages)
I am getting all whitespace if I uncomment `response_format`
I have the same error with json_object did anyone encounter this error with previous version?
same problem when setting "response_format" to {"type": "json_object"}, text generation stops when reaching max model length. When seting "response_format" to {"type": "text"}, everything goes well. Model: Mistral-7B-Instruct-v0.2-Function-Calling vllm: 0.4.1
Outlines has made several improvements to its json output and was previously fixed to outlines==0.0.34
.
These issues might have been fixed with the nightly: https://github.com/vllm-project/vllm/blob/abe855d63774c44e69048dfd188f0333db581d4b/requirements-common.txt#L20
I think that PR #4109 that was merged into main fixes this issue. (@br3no)
@maxdebayser I recently tried v0.5.0.post1
and vLLM + outlines still exhibits the issue of producing \t
\n
repeatedly until max_length when specifying {"type": "json_object"}
.
Yes, I am also having the same problem with versions
vllm 0.4.2 vllm-nccl-cu12 2.18.1.0.4.0
Curious that what is the cause of this issue and any workaround ? Will definitely appreciate any pointers
Same as you. I have to give up respone_format.
Hi guys!
this probably isn't an outlines
or lm-format-enforcer
issues, but an issue with guided decoding. Copypasta from my response to #8020:
important to note: When you're using json_object
or json_schema
in response_format
, you must instruct the model to produce JSON if you want good results; and you will get the best results if you tell it to produce JSON and you give it an example of what you want.
The following guidance is from the OpenAI docs, but it applies to vLLM as well:
Basically, if you try to force the model to generate JSON when it's trying to generate natural text, it may produce just whitespace if whitespace tokens are more likely than a {
token in the logprobs, since valid JSON can include whitespace before or after. This is very likely if you use json_object or json_schem without telling the model that you want JSON output, and what you want that output to look like, either in the system prompt or in a user message.
Hope that helps!
It may be worth adding something about this in the vLLM docs -- seems to be a point of confusion; have been discussing this with the Nous team too
Same issue when using nvidia/NVLM-D-72B
and llava-hf/llava-1.5-7b-hf
-- not passing response_format
seems to work for now but that does not seem like a proper solution.
Your current environment
🐛 Describe the bug
vLLM gets into a corrupted state and only responds garbage after sending a specific response_format = json request. The first request vllm is able to respond with a somewhat reasonable response but once you repeat the same request it only starts responding with
\n\t\t\t\t...
where\t
repeats until max_tokens is reached.Steps to reproduce:
Deploy vLLM v0.4.0-post1 with openai compatible API endpoint and mistral v0.2 instruct from HF. Model ID:
mistralai/Mistral-7B-Instruct-v0.2
. The following config was used:This is running on a single L4 GPU
Send the following request twice
Current results:
\n
will repeat until max tokens is hit\n\t\t\t\t
where\t
repeats until max token is hit.Ocasionally vLLM gets into a bad state where all requests returns errors as well, but I can't consistently get into that state. The following errors were seen when that happens:
This issue was originally reported in Lingo: https://github.com/substratusai/lingo/issues/96 but it seems to be an issue with vLLM itself.
Expected results
vLLM should not get into a broken state where subsequent responses do not provide any results due to using response_format = json