and I success to speed up inference with high prompt throughput and generation throughput .
However, when I add guided_json like this
chat_response = await client.chat.completions.create( model=model_name, messages=[ {"role": "user", "content": prompt}, ], extra_body={ "guided_json": response_json, "response_format": {"type": "json_object"}, }, temperature=0.5, )
the prompt throughput is also high, but generation throughput is as low as one request sent.
I guess this problem caused by outlines, and how to solve it.
BTW, asynchronous inference with json mode is very imporant, I need help.
π The feature, motivation and pitch
I launched a LLM service by vllm, and I use AsyncOpenAI function for high throughput output. like this:
async def async_llm_infer_sampling(prompt, api_key, base_url, model_name, response_json): client = AsyncOpenAI(api_key=api_key, base_url=base_url) try: chat_response = await client.chat.completions.create( model=model_name, messages=[ {"role": "user", "content": prompt}, ], temperature=0.5, ) return chat_response.choices[0].message.content except: return json.dumps({"analysis": "ζ ζ³εζ", "score": -1}, ensure_ascii=False)
and I success to speed up inference with high prompt throughput and generation throughput .
However, when I add guided_json like this
chat_response = await client.chat.completions.create( model=model_name, messages=[ {"role": "user", "content": prompt}, ], extra_body={ "guided_json": response_json, "response_format": {"type": "json_object"}, }, temperature=0.5, )
the prompt throughput is also high, but generation throughput is as low as one request sent.I guess this problem caused by outlines, and how to solve it.
BTW, asynchronous inference with json mode is very imporant, I need help.
Alternatives
No response
Additional context
No response