Compare to Open AI's function calling

TillSimon commented 1 year ago

I'm wondering how this compares to Open AI's function calling as that's also made "to more reliably get structured data back from the model." I see that TypeChat aimes to be model-agnostic and lets me pass in TS types. How does the quality of the answer compare? OpenAI finetuned the models to work for function calling. Are TypeChat results as reliable? Or could it be combined?

StephenHodgson commented 1 year ago

+1 to subscribe to this thread 🧵

josiahbryan commented 1 year ago

+1 to subscribe as well

zelinzhao commented 1 year ago

+1 to subscribe as well

wizd commented 1 year ago

It appears to be a comprehensive open-source alternative to OpenAI's function call. I will attempt to integrate TypeChat with this repository (https://github.com/JohannLai/openai-function-calling-tools) in order to make the LLAMA 2 model function similarly to the OpenAI 0613 release.

DanielRosenwasser commented 1 year ago

This is a great question, and it depends on your use-cases.

First off, function calling is currently specific to OpenAI's GPT-3.5 and GPT-4 models, whereas TypeChat programs can theoretically be generated by any sufficiently powerful model (and as usual, I'll give the caveat that the best models here tend to be trained on a mix of both natural language prose and source code). As mentioned, you could use something like Llama 2 or anything else you want.

Second, as of this writing, it appears that gpt-3.5-turbo-0613 and gpt-4-0613 can only specify a single function call per response. That means that each subsequent step would require another request from the underlying model. On the other hand, with TypeChat you can typically get a single response to with each chain stepped together. Of course, the single-function-call-per-response restriction could change in the future.

Finally, there's one more key difference which is that TypeChat's program generation mechanism is both driven by types can be validated on their types. This ensures some level of correctness between steps which is crucial if you want to provide guardrails for the sorts of responses your models will generate.

All this said, the OpenAI functions seem pretty great. We'd love to see if there are good integration points that make sense!

ahejlsberg commented 1 year ago

Finally, there's one more key difference which is that TypeChat's program generation mechanism is both driven by types can be validated on their types. This could help ensure some level of correctness between steps which is crucial if you want to provide guardrails for the sorts of responses your models will generate. Currently TypeChat does not do this, but it could be made to do so.

Actually, we do validate type correctness between steps. Unless I'm misunderstanding what you're saying.

steveluc commented 1 year ago

Here are the specific levels of structure and type checking currently done by the TypeChat library. First, function translate in file typechat.ts verifies that the model output has correct JSON syntax. If it has valid JSON, translate then calls validator.validate to verify that the JSON output matches the schema type specified in the typeName parameter to the function createJsonTranslator. In the case of program output, the validator first checks that the structure of the JSON matches the Program JSON type specified in programSchemaText in program.ts. Then the validator translates the JSON into a TypeScript program and uses the TypeScript compiler to type check this program. If all of these checks pass, translate returns a strongly-typed object for further processing. If a check fails, there is an opportunity for automatic repair using the diagnostic information output by the JSON parser or the TypeScript compiler. Currently, the code supports either zero or one shot (controlled by the attemptRepair flag passed to the function createJsonTranslator) of automatic repair on the type validation checks. A straightforward change would be to enable multiple repair shots on validation and also on the JSON syntax.

steveluc commented 1 year ago

The reason that TypeChat does all of this checking is related to Till Simon's original question that started this thread about reliability and which methods are most reliable. To make a successful natural language interface, we need two things. First, we need a system that is reliable enough so that it rarely fails. Second, we need a way to detect when the system makes a mistake. By constraining the LLM output to a formal representation and subjecting that representation to multiple levels of checking, we can detect the mistakes. Using auto-repair, we can fix some of the mistakes, increasing the perceived success rate. But most critically, for JSON objects/programs that get through the gauntlet of checking and auto-repair, we have a strongly typed object that we know our application can reliably summarize for the user without further use of a language model. This gives the application the opportunity to show the user what changes it will make and have the user confirm that those changes align with the user's intent.

Because probabilistic systems like language models are never 100% reliable and because human language is inherently highly ambiguous, we need a system of checking and confirming to make sure that permanent changes made through a natural language interface will align with user intent.

A challenge with these systems is to iron out incorrect inference as described above while creating an experience that is delightful to use. The main ways to do this are to respond to the user as fast as possible and to rarely respond with "I didn't understand, please rephrase." In the TypeChat project, we're experimenting with some ways to do this, such as using local parsing for simple commands and tuning auto-repair so that it reduces clarification requests while maintaining good average request latency.

The functions facility in GPT 0613 is great to see, because fine tuning for JSON output should reduce incorrect inference in JSON generation. For now, we can't directly take advantage of the function_call message because it has only one level of nesting, but we have observed that JSON generation is also faster and more accurate in this GPT version, which helps TypeChat.

No amount of fine-tuning, however, will get to 100% reliability, which is why we will always need some form of checking and ironing out of incorrect inference to support NL interfaces to applications that make permanent changes.

DanielRosenwasser commented 1 year ago

Thanks @ahejlsberg I've amended my answer!

TillSimon commented 1 year ago

Thanks @DanielRosenwasser @steveluc for your in-depth explanation!

caridy commented 1 year ago

Function Calling works by looking at the parameters, it doesn't care about the output of such function, meaning you must executed one by one by asking the LLM what to do next. TypeChat is very very different.

Even though you can specify something like ReturnValueSchema for each function, and the LLM would understand what that means, the number of tokens are quite less if you stick to describing your functions as part of the prompt/message rather than functions. IMO TypeChat approach to describe the API is much better.

That doesn't mean that this project should not use function calling, I think it should use function calling when supported, that way you have the guarantees that you are going to get a good json rather than parsing, or doing gymnastics to understand the response structure. Perhaps some sort of configuration to send a single function called Program that expects the steps as the string argument to be provided.

steveluc commented 1 year ago

We do plan to add function as an option for cases (like coffeeshop, calendar) in which a JSON object is output as the final intent summary. In those cases, we can posit a single function whose body conforms to the JSON type required by the schema for the application. This can theoretically improve accuracy although we should all measure on a case by case basis because while the fine tuning works toward accuracy, the roughly 5X more tokens of the JSON schema spec works against accuracy. And of course, a JSON schema is required. There are npm packages that will convert TypeScript types to JSON schema or a person can author the schema.

steveluc commented 1 year ago

And we agree that in the case of Program we could also do it by having a JSON schema for a single Program function with a single argument steps. That's a lower priority because if you look at a case like music, you then have a mixture of JSON schema and TypeScript schema for the API (because a large API begins to consume many tokens if elaborated as JSON schema and you don't get the benefit of the additional TypeScript typeflow pass).

harryvu commented 1 year ago

Finally, there's one more key difference which is that TypeChat's program generation mechanism is both driven by types can be validated on their types. This could help ensure some level of correctness between steps which is crucial if you want to provide guardrails for the sorts of responses your models will generate. Currently TypeChat does not do this, but it could be made to do so.

Actually, we do validate type correctness between steps. Unless I'm misunderstanding what you're saying.

Thanks for clarification, sir.

tiagoefreitas commented 1 year ago

Any update on this? I was planning to use openai functions with zod-to-json-schema and a retry mechanism passing zod validations errors to the llm, then found typechat.

Json supports enums to make sure data conforms to my expected values dynamically and not just static typing (such as a list of ids I want to constrain).

Does typechat support this?

hax commented 1 year ago

It seems the new GPT models released yesterday support multiple function calls (also json mode and other updates).

ahejlsberg commented 1 year ago

As I understand it, the GPT function call extensions allow a request to be translated into multiple independent function calls. The function calls are not ordered and no data flows between them. So, it's not as expressive as TypeChat's JSON Programs support.

The new JSON mode ensures responses are well-formed JSON, but there is no schema validation. We haven't really had issues with non-JSON responses, but certainly it's nice to get a stronger guarantee.

microsoft / TypeChat

Compare to Open AI's function calling #45