vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
26.76k stars 3.92k forks source link

[Feature]: High throughput has not been achieved in decoding stage when using json format output #7778

Open Liucd0520 opened 3 weeks ago

Liucd0520 commented 3 weeks ago

πŸš€ The feature, motivation and pitch

I launched a LLM service by vllm, and I use AsyncOpenAI function for high throughput output. like this:

async def async_llm_infer_sampling(prompt, api_key, base_url, model_name, response_json): client = AsyncOpenAI(api_key=api_key, base_url=base_url) try: chat_response = await client.chat.completions.create( model=model_name, messages=[ {"role": "user", "content": prompt}, ], temperature=0.5, ) return chat_response.choices[0].message.content except: return json.dumps({"analysis": "ζ— ζ³•εˆ†ζž", "score": -1}, ensure_ascii=False)

and I success to speed up inference with high prompt throughput and generation throughput .

However, when I add guided_json like this

chat_response = await client.chat.completions.create( model=model_name, messages=[ {"role": "user", "content": prompt}, ], extra_body={ "guided_json": response_json, "response_format": {"type": "json_object"}, }, temperature=0.5, ) the prompt throughput is also high, but generation throughput is as low as one request sent.

I guess this problem caused by outlines, and how to solve it.

BTW, asynchronous inference with json mode is very imporant, I need help.

Alternatives

No response

Additional context

No response

AlphaINF commented 2 weeks ago

same problem! @simon-mo