puppetm4st3r commented 9 months ago

System Info

Hello, first of all, congratulate you for the great work you have done. I have tested the tool selection feature with the official open ai client and docker image 1.4.3, and I have noticed that the tool selection result does not talk to the open ai api result specification.

Information

[X] Docker
[ ] The CLI directly

Tasks

[X] An officially supported command
[ ] My own modifications

Reproduction

send a request to the tgi 1.4.3 docker

tools = [
    {
    "type": "function",
    "function": {
        "name": "get_current_weather",
        "description": "Get the current weather",
        "parameters": {
            "type": "object",
            "properties": {
                "location": {
                    "type": "string",
                    "description": "The city and state, e.g. San Francisco, CA",
                },
                "format": {
                    "type": "string",
                    "enum": ["celsius", "fahrenheit"],
                    "description": "The temperature unit to use. Infer this from the users location.",
                },
            },
            "required": ["location", "format"],
        },
    },
    },
    {
    "type": "function",
    "function": {
        "name": "get_n_day_weather_forecast",
        "description": "Get an N-day weather forecast",
        "parameters": {
            "type": "object",
            "properties": {
                "location": {
                    "type": "string",
                    "description": "The city and state, e.g. San Francisco, CA",
                },
                "format": {
                    "type": "string",
                    "enum": ["celsius", "fahrenheit"],
                    "description": "The temperature unit to use. Infer this from the users location.",
                },
                "num_days": {
                    "type": "integer",
                    "description": "The number of days to forecast",
                },
            },
            "required": ["location", "format", "num_days"],
        },
    },
    }
]

# Initialize the client, pointing it to one of the available models
client = OpenAI(
  base_url="http://llm_server:3000/v1",
  api_key="_"
)

    from openai import OpenAI
# NOTE: tools defined above and removed for brevity
chat_completion = client.chat.completions.create(
  model="tgi",
  messages=[
      {
      "role": "system",
      "content": "Don't make assumptions about what values to plug into functions. Ask for clarification if a user request is ambiguous.",
      },
      {
      "role": "user",
      "content": "What's the weather like the next 3 days in San Francisco, CA?",
      },
  ],
  tools=tools,
  tool_choice="auto",  # tool selected by model
  max_tokens=500,
)

called = chat_completion.choices[0].message.tool_calls
print(called)

the output is: {'id': 0, 'type': 'function', 'function': {'description': None, 'name': 'tools', 'parameters': {'format': 'celsius', 'location': 'San Francisco, CA', 'num_days': 3}}}

The id is always 0, the description is always None and the name is always tools, It is very difficult to determine which function was actually selected by the model.

Edit: After debugging the code a little, it seems that the problem is in the grammar that is passed to the model to infer the function call. This only allows generating the call parameters but does not include the name of the function that has been selected.

The output from TGI container: 2024-03-05T19:02:25.663524Z INFO chat_completions:generate{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: true, max_new_tokens: Some(500), return_full_text: None, stop: [], truncate: None, watermark: false, details: true, decoder_input_details: true, seed: None, top_n_tokens: None, grammar: Some(Json(Object {"$functions": Object {"get_current_weather": Object {"properties": Object {"format": Object {"description": String("The temperature unit to use. Infer this from the users location."), "enum": Array [String("celsius"), String("fahrenheit")], "type": String("string")}, "location": Object {"description": String("The city and state, e.g. San Francisco, CA"), "type": String("string")}}, "required": Array [String("location"), String("format")], "type": String("object")}, "get_n_day_weather_forecast": Object {"properties": Object {"format": Object {"description": String("The temperature unit to use. Infer this from the users location."), "enum": Array [String("celsius"), String("fahrenheit")], "type": String("string")}, "location": Object {"description": String("The city and state, e.g. San Francisco, CA"), "type": String("string")}, "num_days": Object {"description": String("The number of days to forecast"), "type": String("integer")}}, "required": Array [String("location"), String("format"), String("num_days")], "type": String("object")}}, "properties": Object {"function": Object {"anyOf": Array [Object {"$ref": String("#/$functions/get_current_weather")}, Object {"$ref": String("#/$functions/get_n_day_weather_forecast")}]}}})) } total_time="966.206292ms" validation_time="913.106µs" queue_time="27.862µs" inference_time="965.265444ms" time_per_token="37.125594ms" seed="Some(3959124113330451421)"}: text_generation_router::server: router/src/server.rs:305: Success

Expected behavior

a response accord to the open ai specification in order to get the selected tool in a straightforward manner, expected an ChatCompletionMessage object like:

ChatCompletionMessage(content=None, role='assistant', function_call=None, tool_calls=[ChatCompletionMessageToolCall(id='call_2PArU89L2uf4uIzRqnph4SrN', function=Function(arguments='{\n "location": "Glasgow, Scotland",\n "format": "celsius"\n}', name='get_current_weather'), type='function')])

in the open ai output schema the parameters are named as arguments, thats relevant in orden to support and do not break the use of OpenAI clients or integrations thats are waiting for that field in the request response.

puppetm4st3r commented 9 months ago

I'm trying to solve the problem, making sure to get as close as possible to open ai's response schema in tools/function calling api. It's my first time in Rust but I already managed to compile the solution with my changes, now let's try it and I'll tell you to see if you can guide me on the steps to follow to make a PR :)

puppetm4st3r commented 9 months ago

I solved (I think) but its my first time with Rust and I couldn't get the code to work on my local virtual env (instructions from readme.md didn't work), so I modified the code and ran it across the docker build and the container worked according to the Open AI specification, could you guide me (@drbh ) how to proceed? the regular pipeline to be able to execute locally and the tests to be able to do the PR as appropriate way.

now the ouput for:

from openai import OpenAI
tools = [
      {
          "type": "function",
          "function": {
              "name": "get_current_weather",
              "description": "Get the current weather",
              "parameters": {
                  "type": "object",
                  "properties": {
                      "location": {
                          "type": "string",
                          "description": "The city and state, e.g. San Francisco, CA",
                      },
                      "format": {
                          "type": "string",
                          "enum": ["celsius", "fahrenheit"],
                          "description": "The temperature unit to use. Infer this from the users location.",
                      },
                  },
                  "required": ["location", "format"],
              },
          },
      },
      {
          "type": "function",
          "function": {
              "name": "get_n_day_weather_forecast",
              "description": "Get an N-day weather forecast",
              "parameters": {
                  "type": "object",
                  "properties": {
                      "location": {
                          "type": "string",
                          "description": "The city and state, e.g. San Francisco, CA",
                      },
                      "format": {
                          "type": "string",
                          "enum": ["celsius", "fahrenheit"],
                          "description": "The temperature unit to use. Infer this from the users location.",
                      },
                      "num_days": {
                          "type": "integer",
                          "description": "The number of days to forecast",
                      },
                  },
                  "required": ["location", "format", "num_days"],
              },
          },
      }
  ]
# Initialize the client, pointing it to one of the available models
client = OpenAI(
    base_url="http://llm_server:3000/v1",
    api_key="_"
)

# NOTE: tools defined above and removed for brevity

chat_completion = client.chat.completions.create(
    model="tgi",
    messages=[
        {
            "role": "system",
            "content": "Don't make assumptions about what values to plug into functions. Ask for clarification if a user request is ambiguous.",
        },
        {
            "role": "user",
            "content": "What's the weather like the next 3 days in San Francisco, CA?",
        },
    ],
    tools=tools,
    tool_choice="auto",  # tool selected by model
    max_tokens=500,
)

called = chat_completion.choices[0].message.tool_calls
print(called)

code output: [ChatCompletionMessageToolCall(id='0', function=Function(arguments='{"format":"fahrenheit","location":"San Francisco, CA","num_days":3}', name='get_n_day_weather_forecast'), type='function')]

open ai specs from docs outputs: [ChatCompletionMessageToolCall(id='call_ujD1NwPxzeOSCbgw2NOabOin', function=Function(arguments='{\n "location": "Glasgow, Scotland",\n "format": "celsius",\n "num_days": 5\n}', name='get_n_day_weather_forecast'), type='function')]

llm raw output from TGI debug tracing:

{'id': 0, 'type': 'function', 'function': {'name': 'get_n_day_weather_forecast', 'arguments': {'format': 'celsius', 'location': 'San Francisco, CA', 'num_days': 3}}}

I did not touch the streaming methods and the objects ChatCompletionChunk and ChatCompletionDelta, my rust uderstanding is quiet pretty basic.

puppetm4st3r commented 9 months ago

Since forcing the LLM output does not allow the LLM to give conversational feedback, I additionally added the construction of a function that allows the LLM to report a run error when trying to select a tool either because required parameters are missing, or there is no tool that can Comply with the user's request, this allows controlling the precise execution of the tools or not executing them if it is not possible. This was implemented by modifying the default tools prompt, moving the json schema and tool selection instructions presented in the prompt to the last user role prompt (to better guide the LLM).

the new instruction prompt is: Instructions for Tool Selection and Execution:\n1) Tools definitions: You will be presented with a JSON schema representing a set of tools and their execution constraints, intended for responding to user requests.\n2) Direct Matching Required: Select a tool that matches the user's request based on explicitly provided information. Avoid making assumptions about the user's intentions. The selected tool must directly address the request as specified, without inferring additional user intentions.\n3) Handling Incomplete Requests: If the user's request lacks sufficient detail to make a clear tool selection:\n - Do not guess or infer missing parameters.\n - Notify the situation with an error message detailing what specific information is missing.\n4) Error Reporting: If it's determined that no available tools can appropriately respond to the user's request due to missing or mismatched information, report this with an error message explaining in detail the discrepancy and why tool execution isn't possible.\n\nJSON Schema:\n

and the final prompt after applying chat template (with chatml) look like these:

<|im_start|>system
Please resolve the user's request, if it is not possible to resolve the request then report an error.<|im_end|>
<|im_start|>user
User request: Paris temperature today

---------------------------
Instructions for Tool Selection and Execution:
1) Tools definitions: You will be presented with a JSON schema representing a set of tools and their execution constraints, intended for responding to user requests.
2) Direct Matching Required: Select a tool that matches the user's request based on explicitly provided information. Avoid making assumptions about the user's intentions. The selected tool must directly address the request as specified, without inferring additional user intentions.
3) Handling Incomplete Requests: If the user's request lacks sufficient detail to make a clear tool selection:
   - Do not guess or infer missing parameters.
   - Notify the situation with an error message detailing what specific information is missing.
4) Error Reporting: If it's determined that no available tools can appropriately respond to the user's request due to missing or mismatched information, report this with an error message explaining in detail the discrepancy and why tool execution isn't possible.
de paris hoy
JSON Schema:
{"$functions":{"get_current_weather_by_city":{"description":"Given a city gets the current weather from","properties":{"format":{"description":"The temperature unit to use. Infer this from city","enum":["celsius","fahrenheit"],"type":"string"},"location":{"description":"The city name from a valid country (Only city names are valid inputs).","type":"string"},"name":{"const":"get_current_weather_by_city","description":"The name of the function","type":"string"}},"required":["city","format"],"type":"object"},"get_n_day_weather_forecast_by_city":{"description":"Given a city gets an N-day weather forecast","properties":{"format":{"description":"The temperature unit to use. Infer this from the city","enum":["celsius","fahrenheit"],"type":"string"},"location":{"description":"The city name from a valid country (Only city names are valid inputs)","type":"string"},"name":{"const":"get_n_day_weather_forecast_by_city","description":"The name of the function","type":"string"},"num_days":{"description":"The number of days to forecast","type":"integer"}},"required":["city","format","num_days"],"type":"object"},"notify_error":{"description":"Useful to notify when a tool can not be called.","properties":{"error":{"description":"The error or issue to notify","type":"string"}},"required":["error","language"],"type":"object"}},"properties":{"function":{"anyOf":[{"$ref":"#/$functions/get_current_weather_by_city"},{"$ref":"#/$functions/get_n_day_weather_forecast_by_city"},{"$ref":"#/$functions/notify_error"}]}},"required":["function"]}
---------------------------<|im_end|>
<|im_start|>assistant

The llm response is:

{
  "function": {
    "format": "celsius",
    "location": "Paris",
    "name": "get_current_weather_by_city"
  }
}

If I ask for the temperature in the moon for example, the llm response is:

{
  "function": {
    "error": "The request cannot be resolved with the available tools. The request requests the temperature on the moon, but the available tools can only provide weather information for cities on Earth. Please try again with a city on the earth."
  }
}

drbh commented 9 months ago

hi @puppetm4st3r thank you for noting this issue, in order to open a PR you'll need to fork the repo and open a PR from your fork to this repo.

In order to run TGI locally you'll need to build everything and run the text-generation-launcher binary. Please see the local installation instructions on the readme: https://github.com/huggingface/text-generation-inference?tab=readme-ov-file#local-install

puppetm4st3r commented 9 months ago

thanks, will try! is my first intent of contribution in github, regards!

maziyarpanahi commented 9 months ago

Thanks @puppetm4st3r for your great work! I've been using grammar feature in Llama.cpp for function calling since the beginning and I can't wait to integrate this into my TGI setup. Appreciate your contributions here!

puppetm4st3r commented 8 months ago

I'm fine tunning some details after send the PR, but I have realized that for more complex functions (in a production environment) the models are very sensitive to the prompt engineer of both: the descriptions of the tool's json schema and the tooling_prompt, I Have tried many models of different sizes, in 7B it is a disaster and practically does not work (a lot of allucinations in the selected tool parameters), in mixtral 8x7 flavors didnt work well, in 34b it works with some errors, and in a 34bx2 moe there are already an aceptable quality that could be brought to a productive environment. Also I do some experiments in order to mimics more the open AI behaviour for the LLM to respond in a natural languaje form when there is no need to call a function, but it was a mess a lot of confusion by the LLM and did not work in a 70b, maybe with a bigger model but I dont have access to more VRAM so mi limit is 70B

The conclusion is to 100% mimics the open ai function calling there is a lot of work to done and challenges to solve !

@maziyarpanahi I'm planning for my production solution use TGI with 2 models, 1 for NLP, and other for function calling but without guidance, some thing like gorilla-llm/gorilla-openfunctions-v2

maziyarpanahi commented 8 months ago

@maziyarpanahi I'm planning for my production solution use TGI with 2 models, 1 for NLP, and other for function calling but without guidance, some thing like gorilla-llm/gorilla-openfunctions-v2

Thanks @puppetm4st3r for the detailed reply. Are there some examples that failed badly which they shouldn't have? Is the failure due to the model's weakness or a bug in enforcing the grammar? (does the test pass in other serving platform like Llama.cpp with a JSON grammar?)

puppetm4st3r commented 8 months ago

I think is the way of enforcing, small models did not work well inclusive with directly use of guidance frameworks, so in my experience by now in my use cases is better a 7B finetuned model for function calling compared with a 7B forced grammar model.

but by the other side a fintuned 7B for function calling lacks of enough reasoning for complex task, so you maybe can use that for simple taks like send an email, or query a simple sql table.

My best results in terms of quality/cost was to force grammars with a good 34b model, anything lower didnt work well for me, also tried a 72B but with 2k context len because my setup has 48gb, i'm waiting for the new 4bit kv cache from exllama2 to be included in the inference servers, that would allow us to run larger models with larger contexts on consumer GPUs...

By now I'm build a gptq quant to try the gorilla-llm/gorilla-openfunctions-v2 at 4 bits for func calling and a 34b for inference, with 2 TGI instances.

The bad performance of 7B is reflected in a lot of allucination and guessing of the function parameters, also they respect the forced grammar in terms of structure, but the content is a mess if your prompt is not a poem of perfect request definition.

puppetm4st3r commented 8 months ago

@maziyarpanahi it could be helpful for you https://medium.com/@prudant/enabling-function-calling-with-gorilla-llm-gorilla-openfunctions-v2-using-the-openai-protocol-355492d0587d , my last simple but efficient implementation of local function calling

maziyarpanahi commented 8 months ago

@maziyarpanahi it could be helpful for you https://medium.com/@prudant/enabling-function-calling-with-gorilla-llm-gorilla-openfunctions-v2-using-the-openai-protocol-355492d0587d , my last simple but efficient implementation of local function calling

Thanks @puppetm4st3r for sharing that post. I will start using TGI with grammar this week and compare it with Llama.cpp for function calling. (I mainly use 70B in 16-bit)

jphme commented 8 months ago

Maybe relevant here as well - I just commented under the PR ( https://github.com/huggingface/text-generation-inference/pull/1587#issuecomment-1997314497 ):

To make smaller models useful, It would be very beneficial to add a proper documentation for the function definition and function call format (when serialized to strings / in the prompt). Model creators could use this format for finetuning - currently it´s a huge issue that there is no standardized format and everyone does their own (and needs additional wrappers so code doesn't work with OpenAI compatible libs out-of-the-box, as the standard inference stacks don't support it).

We had just yesterday a discussion with Teknium on that, as Nous released a new model with a custom function format (as we did at DiscoResearch in the past).

vibhorag101 commented 8 months ago

I am facing the same issue. The incorrect tools format as per OpenAI specs breaks the compatibility with the instructor package. Hope the PR fixes this.

joumenharzli commented 8 months ago

If anyone is having constantly "tools" as a function name, it's not related to the model you are using, there is a bug here https://github.com/huggingface/text-generation-inference/blob/7eb3d75df183ccdf4d0389095f6a1700cb5da52e/router/src/server.rs#L953

drbh commented 8 months ago

Update

hi @puppetm4st3r the tool response type has been updated in this PR https://github.com/huggingface/text-generation-inference/pull/1650

Further discussion

If anyone is having constantly "tools" as a function name, it's not related to the model you are using, there is a bug here

https://github.com/huggingface/text-generation-inference/blob/7eb3d75df183ccdf4d0389095f6a1700cb5da52e/router/src/server.rs#L953

regarding the tool name, this is due to how functions are constrained in TGI and a larger discussion has been open here: https://github.com/huggingface/text-generation-inference/issues/1657

huggingface / text-generation-inference

Incorrect output for tools (function calling) accord to openai specs #1624

System Info

Information

Tasks

Reproduction

Expected behavior

I did not touch the streaming methods and the objects ChatCompletionChunk and ChatCompletionDelta, my rust uderstanding is quiet pretty basic.

Update

Further discussion