home-assistant / core

:house_with_garden: Open source home automation that puts local control and privacy first.
https://www.home-assistant.io
Apache License 2.0
69.03k stars 28.28k forks source link

Add LLM tools #115464

Closed Shulyaka closed 2 days ago

Shulyaka commented 1 month ago

Proposed change

This PR adds additional helper to provide support to add function calling feature to LLM integrations such as OpenAI Conversation and Google Generative AI Conversation. The currently provided tools are based on the intents.

Additional information in this architecture discussion: https://github.com/home-assistant/architecture/discussions/1068

Type of change

Additional information

Checklist

If user exposed functionality or configuration variables are added/changed:

If the code communicates with devices, web services, or third-party tools:

To help with the load of incoming pull requests:

synesthesiam commented 1 month ago

@Shulyaka I think there's more work to be done with this PR, but I want to make sure I understand the flow of things correctly.

To start, I see that each intent is automatically registered as a tool with its slot schema as arguments. In the future, we should consider exposing the information from intents.yaml to HA through the home-assistant-intents package so that the descriptions are more useful for the LLM.

I am a bit worried about trying to automatically convert intent slot schemas. Some slots may be set based on context, such as a voice satellite's current area/floor. So we may need to manually annotate slots somewhere as settable by an LLM.

Next, it seems that an LLM integration is expected to pass the tool specifications to its internal API. Do you think the OpenAI-style specification will work as well for function calling models? I get that this is easy to pass into an OpenAI-compatible API, but it seems like it would be better for tools to describe themselves with a strongly-typed structure instead of just a dict. I assume that most local AI options will just copy OpenAI anyways, though, so maybe it's not worth it.

Crazy idea: registering a Python function itself as a tool and automatically generating a spec based on its signature. You don't get all the advanced features like min/max for integers, but that might just be something that can go in the function's docstring.

Lastly, the LLM integration needs to detect the tool call, and then call the tool with the arguments as a JSON string. I think it would be better to have separate functions for looking up a tool, validating the tool arguments, and calling the tool. The LLM integration should also be responsible for parsing the JSON, since in all cases the integration knows best how to respond to the different error conditions.

Another problem is needing to pass in all those extra parameters to async_call_tool (like context, text_input and assistant) that don't make sense for every tool, but are ultimately needed to run the intent handler. It feels like we need some other mechanism here to give some tools additional information that's only known when they're about to be called. But I'm not yet seeing how we could do that without requiring every LLM integration to collect a bunch of information just in case specific tools are called :thinking:

Shulyaka commented 1 month ago

Hi! Thanks for reviewing!

I am a bit worried about trying to automatically convert intent slot schemas. Some slots may be set based on context, such as a voice satellite's current area/floor. So we may need to manually annotate slots somewhere as settable by an LLM.

About the context: my expectation is that LLMs will handle the context themselves. We will let them know their device location via the system prompt and they should set the appropriate default area or floor themselves. The system prompt should hint them to do that.

About the annotations: this is actually a good point. Some parameters indeed require descriptions for better performance. I am currently making notes during my testing. For example, we should annotate the domain parameter to be used more often. But it is in fact possible with voluptuous schemas! vol.Optional and vol.Required support additional description parameter, and voluptuous-openapi will use it to annotate the parameter. For example, schema

vol.Schema({
    vol.Required("volume", description="Volume in percent"): vol.Coerce(int)
})

would be translated into

{
    "type": "object",
    "properties": {
        "volume": {
            "type": "integer",
            "description": "Volume in percent",
        },
    },
    "required": ["volume"],
}

So I am going to also add descriptions to certain intents schema as well.

Shulyaka commented 1 month ago

Next, it seems that an LLM integration is expected to pass the tool specifications to its internal API. Do you think the OpenAI-style specification will work as well for function calling models? I get that this is easy to pass into an OpenAI-compatible API, but it seems like it would be better for tools to describe themselves with a strongly-typed structure instead of just a dict. I assume that most local AI options will just copy OpenAI anyways, though, so maybe it's not worth it.

Well, two major players (OpenAI and Google) already use OpenAPI schema (which is based on JSON Schema), so I expect it to be an industry standard, so I chose the same as a native way. The Nexus Raven example scares me to be honest, are we supposed to do an eval() on an LLM output?!

Shulyaka commented 1 month ago

Crazy idea: registering a Python function itself as a tool and automatically generating a spec based on its signature. You don't get all the advanced features like min/max for integers, but that might just be something that can go in the function's docstring.

I thought about it as well. We can inspect a function name, docstring, parameter names, and their type annotations. The tricky part is the parameter descriptions, there is no single standard way to define those, so it would involve some parsing of the function's docstring.

It is still possible with the current design. If we want this in future, we can inherit the llm.Tool class and implement the generation of the JSON schema from python functions.

Shulyaka commented 1 month ago

Lastly, the LLM integration needs to detect the tool call, and then call the tool with the arguments as a JSON string. I think it would be better to have separate functions for looking up a tool, validating the tool arguments, and calling the tool. The LLM integration should also be responsible for parsing the JSON, since in all cases the integration knows best how to respond to the different error conditions.

The idea behind the single function is to capture all exceptions. including JSON parsing errors and wrong number of arguments, and send the error descriptions back to LLM as response. All these errors are more like tool-related than integration-related, so I thought it is appropriate. What kind of errors integration knows best how to respond?

Shulyaka commented 1 month ago

Another problem is needing to pass in all those extra parameters to async_call_tool (like context, text_input and assistant) that don't make sense for every tool, but are ultimately needed to run the intent handler. It feels like we need some other mechanism here to give some tools additional information that's only known when they're about to be called. But I'm not yet seeing how we could do that without requiring every LLM integration to collect a bunch of information just in case specific tools are called đŸ¤”

Good point. I also don't like a lot of arguments, and I also don't know a better way. They are not too harmful though. The hass and context are probably needed to most functions. About the rest, if intents need them, then may be some other functions in theory would too, so may be it is ok. The text_input, language, and assistant are not really needed, we can remove them and replace with dummy values, they are there only for the sake of completeness :)

Shulyaka commented 1 month ago

Crazy idea: registering a Python function itself as a tool and automatically generating a spec based on its signature. You don't get all the advanced features like min/max for integers, but that might just be something that can go in the function's docstring.

Here is an example of creating of an assistant tool from a Python function: https://github.com/phidatahq/phidata/blob/01727b8e4c2cafb1d6c3c4b165d4815d4dc36d05/phi/tools/function.py#L27

home-assistant[bot] commented 3 weeks ago

Please take a look at the requested changes, and use the Ready for review button when you are done, thanks :+1:

Learn more about our pull request process.

Shulyaka commented 2 weeks ago

Since there are several comments about the intents, let me move it to a separate comment. The intents are quite a challenge.

  1. We need to know their slot schema in advance in order to advertise it to the LLM
  2. IntentHandler.slot_schema is not the actual schema they use. What they use is _slot_schema, which adds additional {value: validator} layer, for example the following slot_schema:
    {
    "area": str,
    "domain": str,
    }

    becomes the following _slot_schema when it comes to the input validation:

    {
    "area": {"value": str},
    "domain": {"value": str},
    }
  3. We don't want this extra layer of parameters for our LLM tool because it is redundant and because some LLMs can't handle complex objects.
  4. For some intents, the slot_schema is None and _slot_schema throws exception.
  5. Some intents (the DynamicServiceIntentHandler) add additional slots (extra_slots) to the mix when calculating the _slot_schema, which are important but not present in slot_schema

I can see 3 approaches to this:

  1. Refactor the intents. Could be too late for that, but may be something could be done here, for example add a method for each intent class to return an effective schema with all extra fields, but without extra layers or exceptions
  2. Pre-process the slot_schema externally to the IntentTool constructor. This is what I did initially, but it is getting into specifics of particular intent handler implementations
  3. Use the _slot_schema as the most reliable source and then remove the extra layer (in constructor). This is what I did now.

P.S. After writing all of this I realized it might be best to try the first approach. Give me some time to do it.

Shulyaka commented 2 weeks ago

I've made the above change here and in a separate PR: https://github.com/home-assistant/core/pull/116789 Will rebase this PR once https://github.com/home-assistant/core/pull/116789 is merged.

Shulyaka commented 6 days ago

There are 4 questions left.

  1. Whether @synesthesiam approves #116789
  2. Whether we need a response type, like IntentResponse. IMO, there is not much we can put there, so having a type with a single field response or result looks redundant, but I am open for opinions. The errors can be reported by raising exceptions.
  3. Whether we need additional method to verify the tool applicability. I fell like we do, but not in this PR
  4. How do we load the intent component? It is not loaded automatically. Should we assume the conversation agent to depend on it (if it wants to use the basic intents), or to dynamically load it, or we must load it automatically somehow?
allenporter commented 6 days ago
  1. Whether we need a response type, like IntentResponse. IMO, there is not much we can put there, so having a type with a single field response or result looks redundant, but I am open for opinions. The errors can be reported by raising exceptions.

The reason for this is to ensure the type is JsonObjectType | None since you require the response to be json serializable.

  1. How do we load the intent component? It is not loaded automatically. Should we assume the conversation agent to depend on it (if it wants to use the basic intents), or to dynamically load it, or we must load it automatically somehow?

My best guess would be adding in after_dependencies

Shulyaka commented 6 days ago

The reason for this is to ensure the type is JsonObjectType | None since you require the response to be json serializable.

We can also use return type annotation for this

My best guess would be adding in after_dependencies

It would be nice to only load them on the first interaction, but unless it has to be in helpers.llm lib itself, let's discuss it when we have a PR for the conversation integrations.