Currently, there is a parallel_tool_calls field that is part of the ChatCompletionRequest pydantic class. However, this field is only there for being compatible with OpenAI's API.
In other words, it's not being used at all according to the documentation or the code:
# NOTE this will be ignored by VLLM -- the model determines the behavior
parallel_tool_calls: Optional[bool] = False
Would it be possible to consider implementing the logic behind this field for different model families. For instance, in the case of llama3.1-8b-insturct, tool calling works, but the model ends up returning three tool calls instead of one by one.
This makes me lose compatibility with frameworks like LangGraph.
[X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
š The feature, motivation and pitch
Currently, there is a parallel_tool_calls field that is part of the
ChatCompletionRequest
pydantic class. However, this field is only there for being compatible with OpenAI's API.In other words, it's not being used at all according to the documentation or the code:
Would it be possible to consider implementing the logic behind this field for different model families. For instance, in the case of llama3.1-8b-insturct, tool calling works, but the model ends up returning three tool calls instead of one by one. This makes me lose compatibility with frameworks like LangGraph.
Here's an example request and response:
Request
Response
Even if I wanted to do a posterior call to the model using the three tool calls at the same time, it will complain with an error of:
BadRequestError: Error code: 400 - {'object': 'error', 'message': 'This model only supports single tool-calls at once!', 'type': 'BadRequestError', 'param': None, 'code': 400}
Which comes from this llama3_json template.
Thank you Team!.
Alternatives
No response
Additional context
No response
Before submitting a new issue...