Open ShervK opened 2 weeks ago
I would prefer a middleware implementation so it can easily be used with different providers such as Ollama, llama.cpp (future), OpenAI compat
Just to clarify - are you suggesting this feature should be implemented as middleware instead and added to the SDK?
If not, would you like me to contribute some documentation around this using middleware? Figure it might help others who are trying to use the SDK with their own hosted models and want an example for handling tool calls, could include an example for llama or qwen 2.5.
I'm quite intrigued by this feature. The idea is that you can define a system prompt for the definition of a tool for models that don't support tool in your middleware, and if a special token like
We can make a POC that is currently not feasible in the ai sdk. If it is successful and the performance is good, it can be included in a library such as ai/middleware.
@ShervK For a quick POC full example, please refer to the repo below https://github.com/minpeter/ai-sdk-preview/tree/tool-call-middleware/packages/ai-config
I did a prompting guide tool call of hermes 3 type and got decent performance. I tested on qwen-based 32, 72b models and successfully ran the multi-step example. However, when tested on smaller models, it often caused hallucinations or output incorrect JSON.
Also, I currently have hardcoded the schema and did not consider calling parallel tools. It is not a big problem and will be fixed soon.
Additionally, We (So what I mean is ...FriendliAI ) prepare to exclusively offer tool calls for custom models on dedicated endpoints. It includes a cool feature to suppress hallucinations in tool calls on smaller models. If you're interested, drop me an email at minpeter@friendli.ai (You can also try out llama 3.1 8b, which already has this feature, on serverless endpoints.
@minpeter Thanks for the example! We're actually using middleware in our project for other features already but not for tool parsing.
In our case, we ended up making a vLLM provider with built-in tool parsers (similar to my first example) since we're running multiple models and wanted to keep the parsing logic close to each model's implementation. Made it way easier to test different models and kept our code cleaner by not having to juggle multiple middleware functions, or one big middleware.
That said, your POC with Hermes 3 is super interesting. Let me know if you need help with the parallel tools support you mentioned, we did something similar with llama 3.1 and got parallel tool calling working.
@ShervK That‘s interesting..! The types of tool call templates I currently know include llama, Hermes, and mistral v3 tekken. Do you know anything else about the tool call parser that needs to be implemented?
It would be great if you could define and use this kind of middleware somewhere in the ai sdk library.
@minpeter Your implementation is great! Some slight changes and it can be used with Qwen or Mistral models so it's pretty well setup. However, it might cause some problems for user's who try to hook it up to Llama 3.1 or 3.2 due to Llama's prompt format and how "chatty" it can be.
One problem that kept coming up for us is Llama's format for their code interpreter, doesn't really have an end tag. They do have the <|eom_id|>
token as the "end tag" but now you have to ask for the bos tokens for each request and filter them out, which we didn't like.
This also means you have to juggle when llama is using code interpreter vs when its using a built in tool.
You can try and force llama to adhere to your own format for some of this but I've found that it reduces the reliability of it's tool responses, hence why there's so many finetunes for using tools with Llama. I only mention all this because of how popular llama 3 is.
Another thing is that Llama is quite chatty at temperatures higher than 0.5, so sometimes it'll respond like
Sure! Let me do that for you?
{tool call}
and I'll check this too
{tool call}
So our parser also handles these interleaved tool calls. We didn't want to discourage it since our user's liked seeing responses like this, said that it makes it feel more interactive.
Lastly, the smaller Llama models (1B/3B) can be inconsistent with their tool calling.
<|python_tag|>
even if it's a JSON response. ;
instead of ,
.```json ... ```
markdown code block in it's JSON tool call.That last point shows up more often with smaller models as well, seen it with Qwen2.5 0.5-7B and Ministral 3B. So might help to include a replace.
toolCallString.forEach((toolCall) => {
toolCall.replace("```json","");
toolCall.replace("```","");
...
}
Anyway, it might be beneficial to include your example for regular JSON based tool formats as well as one for Llama 3, I think it would help a lot of people to include either an example or a note on some of this stuff so they don't go through the same problems I did. I'm happy to help and contribute for this, whether it's writing docs or showing code examples.
@ShervK It would be nice to see some guidance added for small models. If there was an option to "correct" all possible mistakes, it would improve tool calling performance.
To make any random model good for tool use, two challenges need to be solved.
Assuming that we cannot modify the base tmpl of the model, there is only one thing we can control: (assuming that all base chat templates are constrained templates that only support rendering by system-user-assistant).
Here are the structured output formats that each model I know of does well:
Lllama [Built-in Tools (Brave, Wolfram)]
# for Search
<|python_tag|>
brave_search.call(query="...")
<|eom_id|>
# for Wolfram
<|python_tag|>
wolfram_alpha.call(query="...")
<|eom_id|>
{"name": "get_current_conditions", "parameters": {"location": "San Francisco, CA", "unit": "Fahrenheit"}}<|eot_id|>
Lllama [User-defined Custom tool calling]
<function=spotify_trending_songs>{"n": "5"}</function><|eom_id|>
Qwen, Hermes
<tool_call>
{"arguments": <args-dict>, "name": <function-name>}
</tool_call>
Mistral v3 tokken
[TOOL_CALLS] [{"name": "get_current_weather", "arguments": {"location": "Paris, France", "format": "celsius"}, "id": "VvvODy9mT"}]</s>
First, it should be investigated which of the three supported by the llama model would work best with a custom tool.
@minpeter Just to make sure I understand - are you suggesting we implement a middleware that adds tool calling support to models that don't natively support it?
My original feature request was actually focused on parsing tool calls from models that already support tools but are being served through APIs that don't expose the tool parameter.
While standardizing tool formats across models would be interesting, I think it might be risky since models not trained for tool use could generate unreliable responses.
First of all, it is true that it cannot be used for models that have not learned how to use tools. If a specific API blocks the tools: field, we cannot insert tool-related content into the chat tmpl for that model. So we have to approach it assuming that only system - user - assistant is available. This does not mean that a model that does not learn tool calls can do this, it is just a description of how to bypass the interface to access the model.
For example, let's say you're running mistral v0.3 7b model on a vllm endpoint without adding the --enable-auto-tool-choice --tool-call-parser mistral
options.
We know that the mistral model contains rendering logic for the 'tool' role, and for the assistant to make a tool_call, but since vllm doesn't expose it, we need to decorate the tool call as a conversation between the assistant and the user.
This is a story about number 2 mentioned above.
@minpeter Ohhh my mistake, I was getting lost there for a second. I submitted a draft PR for your POC for llama 3 tool parsing to show how we've been doing it, wanted to get opinions on it. I opted for JSON based tool calling as even in the reference llama stack implementation they have the JSON format as the default, I still added support Llama's built in tools. More details in the PR description.
It seems like we can achieve the desired functionality without modifying core functionality. I'll take the time to continue exploring it.
Closing this issue since it's decided that middleware is the way to go, with a link to an example thanks to @minpeter.
Keeping it open in case we want to integrate the middleware into the sdk
Feature Description
When using LLM serving frameworks such as vLLM or MLC-LLM , or services that host open-source models like DeepInfra, Fireworks, or OpenRouter, you sometimes run into an issue where the model being served doesn't have a dedicated tool parser yet the model does support tool use. This usually means the
tools
parameter in their OpenAI compliant API either doesn't work or causes an error and you'll have to parse the chat completion for any tool calls manually after the request.While creating a custom provider can address this, it'll need ongoing maintenance and may lead to missing out on new provider features unless manually implemented.
To address this, I suggest adding a setting for a custom tool parser that can be passed to the OpenAI provider when in
compatible
mode. This feature would allow you to define a function that processes either the response message when usinggenerateText
or a stream when usingstreamText
to determine if the response includes a tool call. This way, you can still keep all the benefits of the tool features of the SDK while serving your own models or using an open source model hosting service.Example usage of a basic parser
Use Case
Additional context
For my team’s project, we host several open-source models and switch between them based on the situation or context—Llama 3.1 for general conversations, Mistral for RAG use cases, Qwen for coding, etc. This has led to a lot of iteration on custom providers to support tool use across models, so having this level of customization natively in the SDK would be great. It would let us use the AI SDK for our internal LLM tooling as well (benchmarks, RAG arenas).
I'm not married to the example I showed above, we can discuss a different implementation. I’d be more than happy to work on this and submit a PR if that’s helpful.