Server: support function calling

madroidmaq commented 2 months ago

Fix #784, Related PR, #995 .

Some examples of model requests:

meta-llama/Llama-3.2-1B-Instruct
Qwen/Qwen2.5-0.5B-Instruct
mistralai/Mistral-7B-Instruct-v0.3

View the meta-llama/Llama-3.2-1B-Instruct cURL & response

#### meta-llama/Llama-3.2-1B-Instruct - curl ```shell curl http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "meta-llama/Llama-3.2-1B-Instruct", "messages": [ { "role": "user", "content": "What is the weather in San Francisco?" } ], "tools": [ { "type": "function", "function": { "name": "get_current_weather", "description": "Get the current weather", "parameters": { "type": "object", "properties": { "location": { "type": "string", "description": "The city and country, eg. San Francisco, USA" }, "format": { "type": "string", "enum": ["celsius", "fahrenheit"] } }, "required": ["location", "format"] } } } ] }' ``` - response ```json { "id": "chatcmpl-ce03f09d-fdaf-4a77-b421-146cef104968", "system_fingerprint": "fp_0f3acf0e-bc4e-45bc-b2a9-f0d5e6e500ea", "object": "chat.completions", "model": "meta-llama/Llama-3.2-1B-Instruct", "created": 1727596395, "choices": [{ "index": 0, "logprobs": { "token_logprobs": [-0.125, 0.0, -0.625, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, -1.25, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], "top_logprobs": [], "tokens": [128010, 5018, 1337, 794, 330, 1723, 498, 330, 1723, 794, 330, 456, 11327, 70464, 498, 330, 14105, 794, 5324, 2588, 794, 330, 24661, 13175, 11, 7427, 498, 330, 2293, 794, 330, 66, 41347, 32075, 128008, 128006, 78191, 128007, 271, 128010, 5018, 1337, 794, 330, 1723, 498, 330, 1723, 794, 330, 456, 11327, 70464, 498, 330, 14105, 794, 5324, 2588, 794, 330, 24661, 13175, 11, 7427, 498, 330, 2293, 794, 330, 69, 49010, 32075, 128008, 128006, 78191, 128007, 271, 128010, 5018, 1337, 794, 330, 1723, 498, 330, 1723, 794, 330, 456, 11327, 70464, 498, 330, 14105, 794, 5324, 2588, 794, 330] }, "finish_reason": "length", "message": { "role": "assistant", "content": "<|python_tag|>{\"type\": \"function\", \"function\": \"get_current_weather\", \"parameters\": {\"location\": \"San Francisco, USA\", \"format\": \"celsius\"}}<|eom_id|><|start_header_id|>assistant<|end_header_id|>\n\n<|python_tag|>{\"type\": \"function\", \"function\": \"get_current_weather\", \"parameters\": {\"location\": \"San Francisco, USA\", \"format\": \"fahrenheit\"}}<|eom_id|><|start_header_id|>assistant<|end_header_id|>\n\n<|python_tag|>{\"type\": \"function\", \"function\": \"get_current_weather\", \"parameters\": {\"location\": \"" } }], "usage": { "prompt_tokens": 232, "completion_tokens": 100, "total_tokens": 332 } } ```

View the Qwen/Qwen2.5-0.5B-Instruct cURL & response

#### Qwen/Qwen2.5-0.5B-Instruct curl: ```shell curl http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen/Qwen2.5-0.5B-Instruct", "messages": [ { "role": "user", "content": "What is the weather in San Francisco?" } ], "tools": [ { "type": "function", "function": { "name": "get_current_weather", "description": "Get the current weather", "parameters": { "type": "object", "properties": { "location": { "type": "string", "description": "The city and country, eg. San Francisco, USA" }, "format": { "type": "string", "enum": ["celsius", "fahrenheit"] } }, "required": ["location", "format"] } } } ] }' ``` --- ```json { "id": "chatcmpl-a704f0ce-16dd-45ef-8620-b7dce4b6e8ed", "system_fingerprint": "fp_2e665298-ed11-4424-b23f-812fda71d9d4", "object": "chat.completions", "model": "Qwen/Qwen2.5-0.5B-Instruct", "created": 1727594741, "choices": [ { "index": 0, "finish_reason": "stop", "message": { "role": "assistant", "content": "\n{\"name\": \"get_current_weather\", \"arguments\": {\"location\": \"San Francisco, USA\"}}\n" } } ], "usage": { "prompt_tokens": 209, "completion_tokens": 24, "total_tokens": 233 } } ```

View the mistralai/Mistral-7B-Instruct-v0.3 cURL & response

#### mistralai/Mistral-7B-Instruct-v0.3 - curl ```shell curl http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "mistralai/Mistral-7B-Instruct-v0.3", "messages": [ { "role": "user", "content": "What is the weather in San Francisco?" } ], "tools": [ { "type": "function", "function": { "name": "get_current_weather", "description": "Get the current weather", "parameters": { "type": "object", "properties": { "location": { "type": "string", "description": "The city and country, eg. San Francisco, USA" }, "format": { "type": "string", "enum": ["celsius", "fahrenheit"] } }, "required": ["location", "format"] } } } ] }' ``` - response ```json { "id": "chatcmpl-c724d955-63d3-404d-8af5-e775dfecc27e", "system_fingerprint": "fp_bcee6a65-1462-44ec-9272-275e3333063c", "object": "chat.completions", "model": "mistralai/Mistral-7B-Instruct-v0.3", "created": 1727597917, "choices": [{ "index": 0, "logprobs": { "token_logprobs": [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, -0.625], "top_logprobs": [], "tokens": [5, 1501, 7567, 1629, 2032, 1113, 1295, 29498, 3790, 29498, 1537, 1991, 1316, 1113, 17452, 2032, 10598, 3501, 2032, 1113, 18672, 10454, 29493, 7803, 1316, 1113, 4530, 2032, 1113, 29485, 1958, 3938, 29507, 1743, 29561, 2] }, "finish_reason": "stop", "message": { "role": "assistant", "content": "[TOOL_CALLS] [{\"name\": \"get_current_weather\", \"arguments\": {\"location\": \"San Francisco, USA\", \"format\": \"celsius\"}}]" } }], "usage": { "prompt_tokens": 114, "completion_tokens": 36, "total_tokens": 150 } } ```

Limitation

Currently, chat_template is used to encode and process input, but there is no unified method for handling output yet. Therefore, manual parsing is required for each model's output. For details, see: Tool Use, Unified .

vlbosch commented 2 months ago

I tried this after installing your other PR with function calling support. For example, Qwen does consequently reply with the tool/function call and stops: " {"name": "get_web_search_results", "arguments": {"keyword": "OpenAI function calling format"}} ".

Shouldn't the "finish_reason" be "tool_calls" so that it's OpenAI-compatible, as per https://platform.openai.com/docs/guides/function-calling ? I think that's how apps like TypingMind and other front ends know there's a tool to call.

madroidmaq commented 2 months ago

I tried this after installing your other PR with function calling support. For example, Qwen does consequently reply with the tool/function call and stops: " {"name": "get_web_search_results", "arguments": {"keyword": "OpenAI function calling format"}} ".

Shouldn't the "finish_reason" be "tool_calls" so that it's OpenAI-compatible, as per https://platform.openai.com/docs/guides/function-calling ? I think that's how apps like TypingMind and other front ends know there's a tool to call.

You are right, "finish_reason" should be "tool_calls." As mentioned in the link above, the function results are still manually parsed, meaning there is currently no good way to determine if the end reason is tool_calls. This PR is not a complete feature implementation, and perhaps when the tokenizer supports a similar decode_function_call function, we can fully support the function calling feature.

awni commented 2 months ago

You are right, "finish_reason" should be "tool_calls."

Does it make sense to do something like:

Check if tools in the request
If it is and the stop condition is met, set the finish_reason to tool_calls ?

madroidmaq commented 2 months ago

You are right, "finish_reason" should be "tool_calls."

Does it make sense to do something like:

Check if tools in the request

If it is and the stop condition is met, set the finish_reason to tool_calls ?

The situation may not be that simple, because including the parameter tools does not necessarily call a function. For example, the following request that includes tools returns a plain text description.

View the Qwen/Qwen2.5-0.5B-Instruct cURL & response

#### Qwen/Qwen2.5-0.5B-Instruct curl: ```shell curl http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen/Qwen2.5-0.5B-Instruct", "messages": [ { "role": "user", "content": "What is the weather in San Francisco?" } ], "tools": [ { "type": "function", "function": { "name": "get_current_weather", "description": "Get the current weather", "parameters": { "type": "object", "properties": { "location": { "type": "string", "description": "The city and country, eg. San Francisco, USA" }, "format": { "type": "string", "enum": ["celsius", "fahrenheit"] } }, "required": ["location", "format"] } } } ] }' ``` --- ```json { "id": "chatcmpl-59ec564f-b853-4ae5-a3ee-d3ad23da1548", "system_fingerprint": "fp_f3f9b5a0-d055-4953-aeb6-71d43470e62e", "object": "chat.completions", "model": "Qwen/Qwen2.5-0.5B-Instruct", "created": 1727789727, "choices": [{ "index": 0, "logprobs": { "token_logprobs": [-1.875, -0.125, 0.0, 0.0, 0.0, -1.125, -0.125, -2.5, -0.5, 0.0, -0.25, -0.625, 0.0, -0.375, -2.25, -5.125, 0.0, -0.125, -3.125, -0.875, -0.375, -6.375, 0.0, 0.0, -0.125, -4.875, -0.75, -4.0, -1.625, -3.875, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, -1.0, 0.0, -0.375, -1.0, -0.25, -0.75, 0.0, -1.625, 0.0, -0.25, -0.625, -2.125, -0.5, -1.25, -0.125, 0.0, -1.625, -1.375, -0.875, -0.75, -1.125, -0.75, -1.0, -0.125, -0.25, -1.875, -3.0, -0.75, -0.375, -0.375, -1.625, -1.875, -4.375, -0.875, -4.375, -2.0, -4.5, -2.625, -0.25, 0.0, -1.25, -0.5, -0.375, 0.0, -0.375, -2.125, -2.5, -0.125, -2.0, -2.125, -2.0, -0.625, -3.375, -0.875, -1.0, -0.125, 0.0, -0.25, 0.0, -2.6875, -5.1875, -0.25, 0.0], "top_logprobs": [], "tokens": [785, 9104, 304, 5836, 12879, 646, 13289, 11941, 11649, 389, 279, 882, 315, 1042, 11, 20849, 11, 323, 2205, 9977, 4682, 2878, 279, 3283, 13, 16246, 429, 498, 4588, 330, 3838, 374, 279, 9104, 304, 5836, 12879, 30, 3670, 358, 646, 3410, 498, 448, 4586, 1995, 911, 279, 2205, 9977, 304, 5836, 12879, 11, 714, 358, 4157, 15440, 279, 1482, 9104, 4682, 476, 7023, 3853, 9104, 12624, 13, 1752, 23560, 9104, 4682, 323, 803, 11682, 1995, 11, 432, 594, 1850, 311, 1779, 279, 2205, 9104, 2473, 11, 279, 3946, 3033, 3910, 315, 5836, 12879, 11, 476, 279, 7043, 5887, 315] }, "finish_reason": "length", "message": { "role": "assistant", "content": "The weather in San Francisco can vary significantly depending on the time of year, latitude, and local climate conditions within the city. Given that you asked \"What is the weather in San Francisco? \", I can provide you with general information about the local climate in San Francisco, but I cannot guarantee the current weather conditions or predict future weather patterns. For precise weather conditions and more detailed information, it's best to check the local weather service, the official government website of San Francisco, or the California Department of" } }], "usage": { "prompt_tokens": 37, "completion_tokens": 100, "total_tokens": 137 } } ```

The approach I currently recommend is:

Check if "finish_reason" is "stop";
If so, manually parse to see if it matches the function rules of each model.

Of course, as mentioned above, this submission does not fully support the Function Calling feature. Full support requires relying on the Huggingface transformers library to provide similar functions.

alelordelo commented 1 month ago

@awni , @madroidmaq ,

How can we check if "finish_reason" is "stop" on a MLXLLM response using Llama 3.2?

I am testing Llama 3.2 tool calls using: https://github.com/mainframecomputer/fullmoon-ios

I did a first test with the prompt mentioned on the example: https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_2/

And MLXLLM response using Llama 3.2 is this, but I don't see any "finish_reason" is "stop"

<|begin_of_text|><|start_header_id|>user<|end_header_id|>
Questions: Can you retrieve the details for the user with the ID 7890, who has black as their special request?
Here is a list of functions in JSON format that you can invoke:
[
    {
        "name": "get_user_info",
        "description": "Retrieve details for a specific user by their unique identifier. Note that the provided function is in Python 3 syntax.",
        "parameters": {
            "type": "dict",
            "required": [
                "user_id"
            ],
            "properties": {
                "user_id": {
                "type": "integer",
                "description": "The unique identifier of the user. It is used to fetch the specific user details from the database."
            },
            "special": {
                "type": "string",
                "description": "Any special information or parameters that need to be considered while fetching user details.",
                "default": "none"
                }
            }
        }
    }
]
Should you decide to return the function call(s), Put it in the format of [func1(params_name=params_value, params_name2=params_value2...), func2(params)]
NO other text MUST be included.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

ml-explore / mlx-examples

Server: support function calling #1003