[Feature]: High throughput has not been achieved in decoding stage when using json format output

🚀 The feature, motivation and pitch

I launched a LLM service by vllm, and I use AsyncOpenAI function for high throughput output. like this:

async def async_llm_infer_sampling(prompt, api_key, base_url, model_name, response_json): client = AsyncOpenAI(api_key=api_key, base_url=base_url) try: chat_response = await client.chat.completions.create( model=model_name, messages=[ {"role": "user", "content": prompt}, ], temperature=0.5, ) return chat_response.choices[0].message.content except: return json.dumps({"analysis": "无法分析", "score": -1}, ensure_ascii=False)

and I success to speed up inference with high prompt throughput and generation throughput .

However, when I add guided_json like this

chat_response = await client.chat.completions.create( model=model_name, messages=[ {"role": "user", "content": prompt}, ], extra_body={ "guided_json": response_json, "response_format": {"type": "json_object"}, }, temperature=0.5, ) the prompt throughput is also high, but generation throughput is as low as one request sent.

I guess this problem caused by outlines, and how to solve it.

BTW, asynchronous inference with json mode is very imporant, I need help.

Alternatives

No response

Additional context

No response

vllm-project / vllm