nath1295 / MLX-Textgen

A python package for serving LLM on OpenAI-compatible API endpoints with prompt caching using MLX.
MIT License
50 stars 6 forks source link

Feature request: tool/function calling #1

Closed vlbosch closed 2 weeks ago

vlbosch commented 1 month ago

Recently MLX got support for function calling. The output must still be parsed manually, so it's not OpenAI-compliant. Supporting the OpenAI HTTP server specification could be a feature that sets this project apart from the rest. Like a drop in replacement with local models. Would be happy to help you implement it.

nath1295 commented 1 month ago

I saw that Outline supports mlx guided decoding, I think the next step is to make it available in this server engine first (like how vllm works), and then function calling should not be too hard to implement. It is safer to do it with guided decoding instead of letting the model generate freely in my opinion. I would be happy to collaborate on that.

vlbosch commented 3 weeks ago

@nath1295 I was working on Outline support, but saw you beat me too it and already merged Outline support. Great work! I was wondering whether you're already busy with tool/function calling too? Before I start working on that as well. I don't mind if you do or are about to start on it, because my time is limited and I'm still getting to know your codebase.

nath1295 commented 3 weeks ago

@vlbosch I still haven’t started on tool calling as I am still thinking about how to make it work with different prompt templates. If you have any questions about the codebase, I am happy to answer them! Would really appreciate if you already have any ideas on making tool calling work and if you have started working on it, I would like to see your PR or code snippets as well.

I think the current problem with integrating tool calling is that the code might not support any random format of chat templates. To solve that, I think we have two options:

  1. Implementing tool calling without using the original chat templates. Use Outlines to force the llm to answer the questions “Do you need a tool to response?” And give it the options “Yes” or “No”. If yes, then generate the tool call object with json schema.
  2. Implement our own versions of different main stream templates and force the model to use one of them while ignoring the original version of the chat template in the tokenizer. I think covering chatml, mistral(llama2), gemma, phi, llama3, and alpaca should be enough.

Of course, if you have any better ideas, please let me know. Thanks for taking a deep dive into my code!

nath1295 commented 2 weeks ago

Tool calling is now supported with the latest update v0.1.0 along with batch inference. Closing this issue.