vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
29.5k stars 4.43k forks source link

[Feature]: Consider parallel_tool_calls parameter at the API level #9451

Open lucasalvarezlacasa opened 2 weeks ago

lucasalvarezlacasa commented 2 weeks ago

šŸš€ The feature, motivation and pitch

Currently, there is a parallel_tool_calls field that is part of the ChatCompletionRequest pydantic class. However, this field is only there for being compatible with OpenAI's API.

In other words, it's not being used at all according to the documentation or the code:

# NOTE this will be ignored by VLLM -- the model determines the behavior
parallel_tool_calls: Optional[bool] = False

Would it be possible to consider implementing the logic behind this field for different model families. For instance, in the case of llama3.1-8b-insturct, tool calling works, but the model ends up returning three tool calls instead of one by one. This makes me lose compatibility with frameworks like LangGraph.

Here's an example request and response:

Request

{
  "messages": [
    {
      "content": "You are a helpful assistant tasked with performing arithmetic on a set of inputs.",
      "role": "system"
    },
    {
      "content": "Add 3 and 4. Multiply the output by 2. Divide the output by 5",
      "role": "user"
    }
  ],
  "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
  "stream": false,
  "n": 1,
  "temperature": 0.0,
  "max_tokens": 256,
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "add",
        "description": "Adds a and b.",
        "parameters": {
          "properties": {
            "a": {
              "description": "first int",
              "type": "integer"
            },
            "b": {
              "description": "second int",
              "type": "integer"
            }
          },
          "required": ["a", "b"],
          "type": "object"
        }
      }
    },
    {
      "type": "function",
      "function": {
        "name": "multiply",
        "description": "Multiply a and b.",
        "parameters": {
          "properties": {
            "a": {
              "description": "first int",
              "type": "integer"
            },
            "b": {
              "description": "second int",
              "type": "integer"
            }
          },
          "required": ["a", "b"],
          "type": "object"
        }
      }
    },
    {
      "type": "function",
      "function": {
        "name": "divide",
        "description": "Divide a and b.",
        "parameters": {
          "properties": {
            "a": {
              "description": "first int",
              "type": "integer"
            },
            "b": {
              "description": "second int",
              "type": "integer"
            }
          },
          "required": ["a", "b"],
          "type": "object"
        }
      }
    }
  ],
  "parallel_tool_calls": false
}

Response

{
  "ChatCompletion": {
    "id": "chat-32cb47446c5b471eba5c91be1755811e",
    "choices": [
      {
        "finish_reason": "tool_calls",
        "index": 0,
        "logprobs": null,
        "message": {
          "content": null,
          "refusal": null,
          "role": "assistant",
          "function_call": null,
          "tool_calls": [
            {
              "id": "chatcmpl-tool-f8c832f4a42445f899a229063004cae9",
              "function": {
                "arguments": '{"a": 3, "b": 4}',
                "name": "add"
              },
              "type": "function"
            },
            {
              "id": "chatcmpl-tool-4b44f70f7dde47d0820f8a3b9018b897",
              "function": {
                "arguments": '{"a": 7, "b": 2}',
                "name": "multiply"
              },
              "type": "function"
            },
            {
              "id": "chatcmpl-tool-d897bd7ecb4b44e59eb718aff21cbfa8",
              "function": {
                "arguments": '{"a": 14, "b": 5}',
                "name": "divide"
              },
              "type": "function"
            }
          ]
        },
        "stop_reason": 128008
      }
    ],
    "created": 1729149431,
    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "object": "chat.completion",
    "service_tier": null,
    "system_fingerprint": null,
    "usage": {
      "completion_tokens": 67,
      "prompt_tokens": 466,
      "total_tokens": 533,
      "completion_tokens_details": null,
      "prompt_tokens_details": null
    },
    "prompt_logprobs": null
  }
}

Even if I wanted to do a posterior call to the model using the three tool calls at the same time, it will complain with an error of:

BadRequestError: Error code: 400 - {'object': 'error', 'message': 'This model only supports single tool-calls at once!', 'type': 'BadRequestError', 'param': None, 'code': 400}

Which comes from this llama3_json template.

Thank you Team!.

Alternatives

No response

Additional context

No response

Before submitting a new issue...

Ugo06 commented 2 weeks ago

+1

frei-x commented 3 days ago

+2

Ritesh2910 commented 1 day ago

+3