vercel / ai

Build AI-powered applications with React, Svelte, Vue, and Solid
https://sdk.vercel.ai/docs
Other
10.18k stars 1.53k forks source link

Custom Tool Parser for Open Source Models #3521

Open ShervK opened 2 weeks ago

ShervK commented 2 weeks ago

Feature Description

When using LLM serving frameworks such as vLLM or MLC-LLM , or services that host open-source models like DeepInfra, Fireworks, or OpenRouter, you sometimes run into an issue where the model being served doesn't have a dedicated tool parser yet the model does support tool use. This usually means the tools parameter in their OpenAI compliant API either doesn't work or causes an error and you'll have to parse the chat completion for any tool calls manually after the request.

While creating a custom provider can address this, it'll need ongoing maintenance and may lead to missing out on new provider features unless manually implemented.

To address this, I suggest adding a setting for a custom tool parser that can be passed to the OpenAI provider when in compatible mode. This feature would allow you to define a function that processes either the response message when using generateText or a stream when using streamText to determine if the response includes a tool call. This way, you can still keep all the benefits of the tool features of the SDK while serving your own models or using an open source model hosting service.

Example usage of a basic parser

import { createOpenAI } from '@ai-sdk/openai';
import { isParsableJson } from "@ai-sdk/provider-utils";

const llama = createOpenAI({
  // other settings
  compatibility: 'compatible',
  textToolParser: (response: string) => {
      if (!response.startsWith("<|python_tag|>")) return [];
      response = response.replace("<|python_tag|>", "");
      if (!isParsableJson(response)) {
        return [];
      }
      const parsed: Array<{ name: string, arguments: Record<string, unknown>}> = JSON.parse(response) 
      return parsed;
    },
  streamToolParser: (chunk: LanguageModelV1StreamPart) => {
      if (chunk.type !== "text-delta") return;
      if (chunk.textDelta.startsWith("<|python_tag|>") {
      //rest of the implementation

    }
});

Use Case

Additional context

For my team’s project, we host several open-source models and switch between them based on the situation or context—Llama 3.1 for general conversations, Mistral for RAG use cases, Qwen for coding, etc. This has led to a lot of iteration on custom providers to support tool use across models, so having this level of customization natively in the SDK would be great. It would let us use the AI SDK for our internal LLM tooling as well (benchmarks, RAG arenas).

I'm not married to the example I showed above, we can discuss a different implementation. I’d be more than happy to work on this and submit a PR if that’s helpful.

lgrammel commented 2 weeks ago

I would prefer a middleware implementation so it can easily be used with different providers such as Ollama, llama.cpp (future), OpenAI compat

ShervK commented 2 weeks ago

Just to clarify - are you suggesting this feature should be implemented as middleware instead and added to the SDK?

If not, would you like me to contribute some documentation around this using middleware? Figure it might help others who are trying to use the SDK with their own hosted models and want an example for handling tool calls, could include an example for llama or qwen 2.5.

minpeter commented 1 week ago

I'm quite intrigued by this feature. The idea is that you can define a system prompt for the definition of a tool for models that don't support tool in your middleware, and if a special token like is returned, you can invoke the too_call parser in your middleware to call the tool.

minpeter commented 1 week ago

We can make a POC that is currently not feasible in the ai sdk. If it is successful and the performance is good, it can be included in a library such as ai/middleware.

minpeter commented 1 week ago

@ShervK For a quick POC full example, please refer to the repo below https://github.com/minpeter/ai-sdk-preview/tree/tool-call-middleware/packages/ai-config

I did a prompting guide tool call of hermes 3 type and got decent performance. I tested on qwen-based 32, 72b models and successfully ran the multi-step example. However, when tested on smaller models, it often caused hallucinations or output incorrect JSON.

Also, I currently have hardcoded the schema and did not consider calling parallel tools. It is not a big problem and will be fixed soon.

Additionally, We (So what I mean is ...FriendliAI ) prepare to exclusively offer tool calls for custom models on dedicated endpoints. It includes a cool feature to suppress hallucinations in tool calls on smaller models. If you're interested, drop me an email at minpeter@friendli.ai (You can also try out llama 3.1 8b, which already has this feature, on serverless endpoints.

ShervK commented 1 week ago

@minpeter Thanks for the example! We're actually using middleware in our project for other features already but not for tool parsing.

In our case, we ended up making a vLLM provider with built-in tool parsers (similar to my first example) since we're running multiple models and wanted to keep the parsing logic close to each model's implementation. Made it way easier to test different models and kept our code cleaner by not having to juggle multiple middleware functions, or one big middleware.

That said, your POC with Hermes 3 is super interesting. Let me know if you need help with the parallel tools support you mentioned, we did something similar with llama 3.1 and got parallel tool calling working.

minpeter commented 1 week ago

@ShervK That‘s interesting..! The types of tool call templates I currently know include llama, Hermes, and mistral v3 tekken. Do you know anything else about the tool call parser that needs to be implemented?

It would be great if you could define and use this kind of middleware somewhere in the ai sdk library.

ShervK commented 1 week ago

@minpeter Your implementation is great! Some slight changes and it can be used with Qwen or Mistral models so it's pretty well setup. However, it might cause some problems for user's who try to hook it up to Llama 3.1 or 3.2 due to Llama's prompt format and how "chatty" it can be.

One problem that kept coming up for us is Llama's format for their code interpreter, doesn't really have an end tag. They do have the <|eom_id|> token as the "end tag" but now you have to ask for the bos tokens for each request and filter them out, which we didn't like. This also means you have to juggle when llama is using code interpreter vs when its using a built in tool.

You can try and force llama to adhere to your own format for some of this but I've found that it reduces the reliability of it's tool responses, hence why there's so many finetunes for using tools with Llama. I only mention all this because of how popular llama 3 is.

Another thing is that Llama is quite chatty at temperatures higher than 0.5, so sometimes it'll respond like

Sure! Let me do that for you?

{tool call}

and I'll check this too

{tool call}

So our parser also handles these interleaved tool calls. We didn't want to discourage it since our user's liked seeing responses like this, said that it makes it feel more interactive.

Lastly, the smaller Llama models (1B/3B) can be inconsistent with their tool calling.

That last point shows up more often with smaller models as well, seen it with Qwen2.5 0.5-7B and Ministral 3B. So might help to include a replace.

toolCallString.forEach((toolCall) => {
  toolCall.replace("```json","");
  toolCall.replace("```","");
  ...
}

Anyway, it might be beneficial to include your example for regular JSON based tool formats as well as one for Llama 3, I think it would help a lot of people to include either an example or a note on some of this stuff so they don't go through the same problems I did. I'm happy to help and contribute for this, whether it's writing docs or showing code examples.

minpeter commented 1 week ago

@ShervK It would be nice to see some guidance added for small models. If there was an option to "correct" all possible mistakes, it would improve tool calling performance.

To make any random model good for tool use, two challenges need to be solved.

  1. Based on the tool information included in the system prompt, we will derive a structured output in a parsable format (the structured output method that performs well may vary depending on the model).
  2. The tool_call will be executed successfully and the results will be provided in a way that the model can understand without conflicting with existing templates.
minpeter commented 1 week ago

Assuming that we cannot modify the base tmpl of the model, there is only one thing we can control: (assuming that all base chat templates are constrained templates that only support rendering by system-user-assistant).

Here are the structured output formats that each model I know of does well:

Lllama [Built-in Tools (Brave, Wolfram)]

# for Search
<|python_tag|>
brave_search.call(query="...")
<|eom_id|>

# for Wolfram
<|python_tag|>
wolfram_alpha.call(query="...")
<|eom_id|>

Llama [JSON base]

{"name": "get_current_conditions", "parameters": {"location": "San Francisco, CA", "unit": "Fahrenheit"}}<|eot_id|>

Lllama [User-defined Custom tool calling]

<function=spotify_trending_songs>{"n": "5"}</function><|eom_id|>

Qwen, Hermes

<tool_call>
{"arguments": <args-dict>, "name": <function-name>}
</tool_call>

Mistral v3 tokken

[TOOL_CALLS] [{"name": "get_current_weather", "arguments": {"location": "Paris, France", "format": "celsius"}, "id": "VvvODy9mT"}]</s>
minpeter commented 1 week ago

First, it should be investigated which of the three supported by the llama model would work best with a custom tool.

ShervK commented 1 week ago

@minpeter Just to make sure I understand - are you suggesting we implement a middleware that adds tool calling support to models that don't natively support it?

My original feature request was actually focused on parsing tool calls from models that already support tools but are being served through APIs that don't expose the tool parameter.

While standardizing tool formats across models would be interesting, I think it might be risky since models not trained for tool use could generate unreliable responses.

minpeter commented 1 week ago

First of all, it is true that it cannot be used for models that have not learned how to use tools. If a specific API blocks the tools: field, we cannot insert tool-related content into the chat tmpl for that model. So we have to approach it assuming that only system - user - assistant is available. This does not mean that a model that does not learn tool calls can do this, it is just a description of how to bypass the interface to access the model.

minpeter commented 1 week ago

For example, let's say you're running mistral v0.3 7b model on a vllm endpoint without adding the --enable-auto-tool-choice --tool-call-parser mistral options.

We know that the mistral model contains rendering logic for the 'tool' role, and for the assistant to make a tool_call, but since vllm doesn't expose it, we need to decorate the tool call as a conversation between the assistant and the user.

This is a story about number 2 mentioned above.

ShervK commented 1 week ago

@minpeter Ohhh my mistake, I was getting lost there for a second. I submitted a draft PR for your POC for llama 3 tool parsing to show how we've been doing it, wanted to get opinions on it. I opted for JSON based tool calling as even in the reference llama stack implementation they have the JSON format as the default, I still added support Llama's built in tools. More details in the PR description.

minpeter commented 1 week ago

It seems like we can achieve the desired functionality without modifying core functionality. I'll take the time to continue exploring it.

ShervK commented 4 days ago

Closing this issue since it's decided that middleware is the way to go, with a link to an example thanks to @minpeter.

lgrammel commented 4 days ago

Keeping it open in case we want to integrate the middleware into the sdk