mosaicml / llm-foundry

LLM training code for Databricks foundation models
https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm
Apache License 2.0
3.99k stars 525 forks source link

MPT-30B Functions support #379

Closed musabgultekin closed 6 months ago

musabgultekin commented 1 year ago

Hi, I'm working on fine-tuning the MPT-30B for function calling. Currently still preparing the fine-tuning dataset. AFAIK there is no open-source fine-tuned model for function support(let me know if you know).

I'm not sure about the format and wanted to get some advice. Basically proposing to add "function" and "tool" role to the original ChatML format. And putting the function schema to the system prompt. Here is the starting context. Functions are converted to a schema and have been put as a system prompt. (we can change this to something like "function_definitions" etc) The model should start generating either "assistant" or "function" with after this exact context:

<|im_start|>system
// Plugin for managing a TODO list, you can add, remove and view your TODOs.
namespace todo {

// Add a todo to the list
type addTodo = (_: {
// The todo to add to the list.
todo: string,
}) => any;

} // namespace todo<|im_end|>
<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Add X to my todo list<|im_end|>

If it starts with "function", it'll generate the namespace.function definition. Like this:

<|im_start|>function
todo.addTodo("X")<|im_end|>

Then we'll give this to the user. OR as a library, we can actually call the function too. The function response's role can be named as "tool" like this:

<|im_start|>tool
{"status": "OK"}
<|im_start|>assistant
Okay, I've added!.<|im_end|>

Full raw format:

<|im_start|>system
// Plugin for managing a TODO list, you can add, remove and view your TODOs.
namespace todo {

// Add a todo to the list
type addTodo = (_: {
// The todo to add to the list.
todo: string,
}) => any;

} // namespace todo<|im_end|>
<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Add X to my todo list<|im_end|>
<|im_start|>function
todo.addTodo("X")<|im_end|>
<|im_start|>tool
{"status": "OK"}
<|im_start|>assistant
Okay, I've added!.<|im_end|>

What do you think?

musabgultekin commented 1 year ago

Update:

instead of using seperate tool role, i think its better to go with function as the function response. And seperating function calls with different seperators. So we can use function calls in between messages.

<|im_start|>system
...FUNCTION_DEFINITION PART...<|im_end|>
<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Add X to my todo list<|im_end|>
<|im_start|>assistant
Sure! Lets add it. 
<|fn_start|>todo.addTodo
{"text": "X"}<|fn_end|>
<|im_start|>function
{"status": "OK"}
<|im_start|>assistant
Okay, I've added!<|im_end|>
samhavens commented 1 year ago

We are looking into tool use, thank you for sharing your work. I am curious, have you considered instead of

<|fn_start|>todo.addTodo
{"text": "X"}<|fn_end|>
<|im_start|>function
{"status": "OK"}<|im_end|>

My instinct is to have this be

<|im_start|>system function todo.addTodo
{"text": "X"}
response: {"status": "OK"}<|im_end|>

or something similar, since, won't the system block waiting for the response from the tool? So you can keep it in one message?

Also I am curious how you came up with the syntax for defining tools.

musabgultekin commented 1 year ago

My initial thinking was having two EOS, both im_end and fn_end. So the sampling will stop there too. Let me know if you think one EOS is better for simplicity. Also, since I can define the roles as the owner of the message, it would be much more intuitive to have assistant own the function call, rather than the system. In your revised version, after the function call request, there is only "\n" as far as I see. so that makes it not suitable for generation stop. But maybe we can put the im_end after the function call request in your revised version.

Syntax is essentially simplified TypeScript declaration file (.d.ts). I got the inspiration from chatgpt plugins. The fact that the models trained on lots of "type" text data of typescript code makes this powerful enough to understand what is a "type" or signature of a function. Introducing some DSL that is completely different from the pretraining data would make it not understand well enough. Also note that functions always takes one single object parameter that is typed. So the model will know that it should output an object and not strings/ints seperated by some arbitrary seperator when passing to the functions.

Here is an extended version. But we dont need the implementation part, so we can get rid of it:

// Plugin for managing a TODO list, you can add, remove and view your TODOs.
namespace todo {

    // Function to add a todo to the list
    function addTodo(_: { todo: string; }): any {
        // implementation here
    }

} // namespace todo
unaidedelf8777 commented 1 year ago

Have you considered adopting the same function-defining schema that OpenAI uses? Personally, I haven't encountered any issues with this schema and believe that maintaining consistency might help others, since they won't need to redefine a function that was initially used with OpenAI's function calling when they use their functions with the MPT-30b function calling. Just a thought.

float-trip commented 1 year ago

I'd caution against adding tokens if you can avoid it. I recently finetuned MPT-30b on a reddit-style dataset, and initially made heavy use of special tokens as separators (<|post_title|>title here<|post_author|>username....) I found that even after training on 150M tokens it would still associate these with their correct meaning very loosely. Retraining with plaintext separators ([Title] title here [Author] username...) solved the issue completely.

Granted, I was going about it a bit excessively (10 or so custom tokens), but the poor results were surprising and lead me to believe that you'll be better off not adding <|fn_start|> and <|fn_end|> - especially if you don't have hundreds of millions of tokens to finetune on.

musabgultekin commented 1 year ago

Have you considered adopting the same function-defining schema that OpenAI uses?

@unaidedelf8777 If you mean the JSON Schema Object definition, then that is something that you use on the API but not necessarily in the model. That JSON Schema definition can be converted to the schema that the model would use.

@float-trip That is a very useful information! What about special tokens that we would add to the dictionary? For example you would have added "<|post_title|>" as a seperate token that never existed in the pre-training step but it would exist in the fine-tuning step. So the model doesnt have any associations of "title", but it would know that its something different in the embedding space. same goes for <|fn_start|> and others. It could require a custom tokenizer other than the default mpt30b though.

float-trip commented 1 year ago

If I'm understanding right, that's what I did - I modified the tokenizer like this:

from transformers import GPTNeoXTokenizerFast

tokenizer = GPTNeoXTokenizerFast.from_pretrained(
    "EleutherAI/gpt-neox-20b",
    additional_special_tokens=[
        "<|post_title|>",
        "<|post_url|>",
        "<|post_author|>",
        # ...
    ],
)

tokenizer.save_pretrained("tokenizer")

The model was somewhat able to learn the correct meanings for these (especially for the more frequently occurring tokens like<|comment_author|>, <|comment_body|>, etc.) It's possible that only adding a couple new tokens would be fine, and the problem only appears when going overboard. But I'd at least do a test run without any new tokenization to see if it performs better.

musabgultekin commented 1 year ago

Thanks @float-trip . Got it. Then we can remove the fn_start and instead used im_start with assistant role. (Like OpenAI API returns assistant role on function calls) Then we can use the existing tokens that we have used but using the fields for those roles. Basically an arbitrary map for every role that they just examples on chatml docs.

<|im_start|>system
...FUNCTION_DEFINITION PART...<|im_end|>
<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Add X to my todo list<|im_end|>
<|im_start|>assistant
Sure! Lets add.
<|im_start|>assistant to=todo.addTodo
{"text": "X"}<|im_end|>
<|im_start|>function name=todo.addTodo
{"status": "OK"}
<|im_start|>assistant
Okay, I've added!<|im_end|>
unaidedelf8777 commented 1 year ago

@musabgultekin, Yeah I don’t know anything about the tokenizers really. I just finished building out a massive system with the openai schemas, ( which it dynamically constructs ). I just didn’t want to rebuild that logic bc it’s quite complex, and I figured most others wouldn’t want to do similar things when migrating.

AlbertMarashi commented 1 year ago

@musabgultekin @samhavens @float-trip @unaidedelf8777

You should all check out this twitter thread. It forces the LLM to output valid "code" following function/grammar specs by zeroing out probabilities for tokens which should not exist. It might be worth looking into.

The benefits are:

https://twitter.com/GrantSlatton/status/1657559506069463040

musabgultekin commented 1 year ago

If the goal is to run parsable outputs for the given instruct, it solves the issue and a good solution since it doesn't require further fine-tuning. Really Like it.

This would work for function only models. For example if you have a for loop that decides what to do always, then this works. You can have a for loop for a robot, that it can only do "go_forward", "go_backwards" "wait" etc, then its good. But if it needs to decide not to call any functions but instead needs to ask followup questions, then it wont work. Fine-tuning is the way to go in that case.

Thanks for the info @AlbertMarashi ! Checking it out

float-trip commented 1 year ago

@AlbertMarashi For a more general/flexible version of that idea, check out LogitsWarper and LogitsProcessor in HuggingFace's transformers library.

musabgultekin commented 1 year ago

I've finally prepared a good looking dataset ready for fine-tuning. It has 5300~ examples. 500~ different schemas. All different prompts, %50 contains function calls, %50 has function schemas but no callsto teach the model when not to call the functions.

But 8xA100 40GB gave OOM. I've looked two clouds for 8xA100 80GB, but no availability. I'm gonna have to defer training until I find some big GPUs for now.

float-trip commented 1 year ago

I have a bash script in here which provisions an 8xA100 80gb from LambdaLabs: https://gist.github.com/float-trip/679019a23f246b17d2dff9e2cf55c387

It generally still takes a few hours, but if you leave it running you'll get it eventually.

musabgultekin commented 1 year ago

Oh amazing! @float-trip Will use it thanks 🙏

unaidedelf8777 commented 1 year ago

I’m sleeking on a dataset of function definitions, prompts to call the functions, example responses from the functions, and a message that the model gives based on the response from the function.

All I did was scrape a ton of openapi schemas from the apisguru repository, and then turned them into function definitions, then devised a gpt prompt to make up prompts to call those functions, example responses from the functions, and the model responses based on those function responses.

It is chewing away rn. Will ping y’all when I throw it on hugging face… more to come.

unaidedelf8777 commented 1 year ago

I've finally prepared a good looking dataset ready for fine-tuning. It has 5300~ examples. 500~ different schemas. All different prompts, %50 contains function calls, %50 has function schemas but no callsto teach the model when not to call the functions.

But 8xA100 40GB gave OOM. I've looked two clouds for 8xA100 80GB, but no availability. I'm gonna have to defer training until I find some big GPUs for now.

You should try RunPod.Io, their pricing is slightly higher than lambdas, but they usually have availability for basically everything. It says here that for a A100 SXM they only want $1.44 USD per card hour. Not too bad in my opinion.

One other thing, if you haven’t used gcp yet, they’ll give you like 300 dollars of credits for a free trial. You will probably have to get in touch with their team if you want more than one A100 though. But in my experience they’re pretty good about fast replies.

unaidedelf8777 commented 1 year ago

I’m sleeking on a dataset of function definitions, prompts to call the functions, example responses from the functions, and a message that the model gives based on the response from the function.

All I did was scrape a ton of openapi schemas from the apisguru repository, and then turned them into function definitions, then devised a gpt prompt to make up prompts to call those functions, example responses from the functions, and the model responses based on those function responses.

It is chewing away rn. Will ping y’all when I throw it on hugging face… more to come.

It will have around 27000 examples of functions and completions. No extra training to tell the model when to call the functions really, only how to.

musabgultekin commented 1 year ago

@unaidedelf8777 Thats great! I'm wondering if you only used one function per example and thats how you reached to 27k. Cause the site has 2.5k APIs. Also, did you try making the requests? cause it would be bad if most of them are auth required.

musabgultekin commented 1 year ago

If I'm understanding right, that's what I did - I modified the tokenizer like this:

from transformers import GPTNeoXTokenizerFast

tokenizer = GPTNeoXTokenizerFast.from_pretrained(
    "EleutherAI/gpt-neox-20b",
    additional_special_tokens=[
        "<|post_title|>",
        "<|post_url|>",
        "<|post_author|>",
        # ...
    ],
)

tokenizer.save_pretrained("tokenizer")

The model was somewhat able to learn the correct meanings for these (especially for the more frequently occurring tokens like<|comment_author|>, <|comment_body|>, etc.) It's possible that only adding a couple new tokens would be fine, and the problem only appears when going overboard. But I'd at least do a test run without any new tokenization to see if it performs better.

@float-trip I think you also needed to resize the embeddings too

Checkout : https://github.com/tatsu-lab/stanford_alpaca/blob/main/train.py#L65C42-L65C42

unaidedelf8777 commented 1 year ago

@unaidedelf8777 Thats great! I'm wondering if you only used one function per example and thats how you reached to 27k. Cause the site has 2.5k APIs. Also, did you try making the requests? cause it would be bad if most of them are auth required.

No, the function responses are synthetic, but were verified to make sure they have the correct schema. I have the dataset repo setup rn, it has the prompt I used in there. the repo is empty except for a readme and the prompt because the API is slow asl and I didn't feel like making it asynchronous. and to answer your question, yes currently it is only one prompt one function call, however I plan on iterating and improving the dataset soon enough.

IF this dataset picks up any traction, do you think that I should make a petreon, since openai is expensive for this kinda thing.

also here's the hf link

unaidedelf8777 commented 1 year ago

@unaidedelf8777 Thats great! I'm wondering if you only used one function per example and thats how you reached to 27k. Cause the site has 2.5k APIs. Also, did you try making the requests? cause it would be bad if most of them are auth required.

No, the function responses are synthetic, but were verified to make sure they have the correct schema. I have the dataset repo setup rn, it has the prompt I used in there. the repo is empty except for a readme and the prompt because the API is slow asl and I didn't feel like making it asynchronous. and to answer your question, yes currently it is only one prompt one function call, however I plan on iterating and improving the dataset soon enough.

IF this dataset picks up any traction, do you think that I should make a petreon, since openai is expensive for this kinda thing.

also here's the hf link

ill ping yall when its finished.

unaidedelf8777 commented 1 year ago

@musabgultekin @AlbertMarashi @float-trip @samhavens

Just uploaded the part of the dataset which is finished, its only around 1k examples though. more to come.

https://huggingface.co/datasets/unaidedelf87777/openapi-function-invocations-25k/

the preview is also messed up so it shows the prompt I used instead of the csv. if someone knows how to fix that please lmk.

float-trip commented 1 year ago

@float-trip I think you also needed to resize the embeddings too

Checkout : https://github.com/tatsu-lab/stanford_alpaca/blob/main/train.py#L65C42-L65C42

Thanks, this is true in cases where model.config.vocab_size < len(tokenizer). GPT-NeoX-20B and the MPT models leave some extra room in the vocab size for performance reasons.

musabgultekin commented 1 year ago

Hi @unaidedelf8777 , Thank you for the dataset! If possible, can you save your dataset in jsonlines format? (seperate json on each line). Cause apparently the CSV has problems with your data/saving method and its not valid. Currently it doesnt look like its usable unfortunately :(

unaidedelf8777 commented 1 year ago

Hi @unaidedelf8777 , Thank you for the dataset! If possible, can you save your dataset in jsonlines format? (seperate json on each line). Cause apparently the CSV has problems with your data/saving method and its not valid. Currently it doesnt look like its usable unfortunately :(

Yeah, sorry ab that. It’s because some entries span a couple lines. I have a cleaned version, I just am yet to throw it up there yet. If you look in the Data folder of the repo, there are train and Validation jsonl files, pre formatted with the special tokens which I described in the dataset card. I also updated to a new repo so here's the link

unaidedelf8777 commented 1 year ago

Hi @unaidedelf8777 , Thank you for the dataset! If possible, can you save your dataset in jsonlines format? (seperate json on each line). Cause apparently the CSV has problems with your data/saving method and its not valid. Currently it doesnt look like its usable unfortunately :(

Yeah, sorry ab that. It’s because some entries span a couple lines. I have a cleaned version, I just am yet to throw it up there yet. If you look in the Data folder of the repo, there are train and Validation jsonl files, pre formatted with the special tokens which I described in the dataset card.

I’m working on a loading script for the dataset right now which will serve the jsonl files. I just don’t know how to write it, and gpt seems not to know either so I’ll have to actually look into it ☹️

musabgultekin commented 1 year ago

I've wrote a new training system that calculates loss on assistant responses and function calls from assistant. Used LLaMA 7B as base and Huggingface trainer. Trained it on 37k examples (34k ShareGPT conversations(wizard vicuna uncensored) + 3k GPT4 generated prompts and function calls). Sadly I wasn't able to find 8x A100 80GB consistently for the training experiments. It was stressful, cause I've already had the MPT training dataset, code and I simply wasn't able to find some compute to train it. So thats why I decided to start with a smaller model. (MPT 7B also failed on nodes that I have tried)

Here is the repo and inference code that works rn: https://github.com/musabgultekin/functionary/blob/main/inference.ipynb

My preliminary manual tests shows that its working on instructions. It knows when to call a function and which function. Will prepare the codebase. Need to write evaluation suite and custom tooling around it though.

There are issues of course; like multiple rounds not working properly due to the dataset is only has one round per conversation for function calls. Also its hallucinating when commentating over function outputs for some reason, e.g.: adding stuff that doesn't exist on the function outputs. I hope we can reduce it by training on 13B, more sharegpt convs and also potentially using @unaidedelf8777 's dataset.

Per @float-trip 's suggestion, I've not introduced new tokens (will need eval suite first, then will do the ablation study).

I'll share more details, design choices, and code soon. I feel like current status is just proof of concept. Please let me know what are your suggestions.

Here is how it decides to call functions:

Screenshot 2023-07-11 at 16 02 53

Here is how it uses the function output:

Screenshot 2023-07-11 at 16 02 20
unaidedelf8777 commented 1 year ago

@musabgultekin,

I'd recommend using runpod.io, I can consistently get 8 a6000 48gb cards with 1tb of ram and 64 CPU cores for like $6 a hour ( USD) its a ton better deal than providers like Lambdalabs, or google cloud.

musabgultekin commented 1 year ago

Its been nearly a month since I started, but I managed to make function calling working properly with LLaMA. Unfortunately not MPT because of OOMs that I got constantly.

Here is the full repo for inference and details of dataset: https://github.com/musabgultekin/functionary

I'm gonna add training code and more info soon.

@unaidedelf8777 I saw your mpt-7b-CodeCaller-v1 . How is it going? Cause first, I trained a single turn conversation but it didnt work out well. So I had to use multiple turns in one conversation to make it actually work properly.

unaidedelf8777 commented 1 year ago

Its been nearly a month since I started, but I managed to make function calling working properly with LLaMA. Unfortunately not MPT because of OOMs that I got constantly.

Here is the full repo for inference and details of dataset: https://github.com/musabgultekin/functionary

I'm gonna add training code and more info soon.

@unaidedelf8777 I saw your mpt-7b-CodeCaller-v1 . How is it going? Cause first, I trained a single turn conversation but it didnt work out well. So I had to use multiple turns in one conversation to make it actually work properly.

@musabgultekin , that repo was just a failed LoRA fine-tune that I whipped out yesterday. I am still ironing out my dataset, and trying to really get rid of the useless entries. Right now My current plan is to add a few new attention heads on the upper layers, since I have noticed that the model really tries to call the functions/ doesn't know the format so it just gives mumbled garbage which it makes up. all I need to do is find on what layer/chunk of layers it is generating the garbage, and replace the garbage it makes up with a function call, directed by the attention head(s).

Also, did you get it working on Llama 1 or 2? just curious. love the functionary screaming llama on your repo also!

One last thing, I saw on the repo that you mentioned that you were saving up to train the model.. Have you seen the petals project? because they support the llama 70b models, and I imagine the others aswell, and its also completely free. only potential caveat I see is that it is distributed swarm training, and thus it might be slower, but I don't know, I have only performed inference on the network. definitely good to look into!

musabgultekin commented 1 year ago

@unaidedelf8777 I' think its probably the model doesnt know the schema of fn_def section that you put in the dataset. Thats why I used TypeScript definitions, which exists in the pretraining dataset of these LLMs. Also Microsoft TypeChat uses the same TypeScript definitions idea: https://github.com/microsoft/TypeChat/blob/d2f2de9ca37ef9adeb108d5fc60703b72fec0a22/site/src/blog/introducing-typechat.md#just-add-types So I don't think adding new attention heads would be sufficient.

If you are convinced, I just committed a function for you that would help you on conversion from apisguru specs to typescript schemas. https://github.com/musabgultekin/functionary/commit/6cde13ca40be1ca4e873955c6d15e8969a578c50 Of yourse you can modify the schema as you need.

Its based on llama 1 not 2. I'm currently checking out llama 2 training, doing some experiments etc.

Thanks for the petals project info! i thought its only for forward pass. Thats great to know, will check out.

unaidedelf8777 commented 1 year ago

@unaidedelf8777 I' think its probably the model doesnt know the schema of fn_def section that you put in the dataset. Thats why I used TypeScript definitions, which exists in the pretraining dataset of these LLMs. Also Microsoft TypeChat uses the same TypeScript definitions idea: https://github.com/microsoft/TypeChat/blob/d2f2de9ca37ef9adeb108d5fc60703b72fec0a22/site/src/blog/introducing-typechat.md#just-add-types So I don't think adding new attention heads would be sufficient.

If you are convinced, I just committed a function for you that would help you on conversion from apisguru specs to typescript schemas. musabgultekin/functionary@6cde13c Of yourse you can modify the schema as you need.

Its based on llama 1 not 2. I'm currently checking out llama 2 training, doing some experiments etc.

Thanks for the petals project info! i thought its only for forward pass. Thats great to know, will check out.

Looking at the training mixture of llama a little closer, I do see what your saying, and the typescript definitions definitely do make a lot more sense when looking at them with no understanding of json. only reason I am really pushing with json definitions is because it is simpler in my opinion. keep in mind that's the opinion of somebody who knows nothing about typescript, but yeah.

Update: I just looked more into how typescript definitions work, and I am definitely on board with the idea, as It does seem simple enough to convert the json schema I was using into typescript definitions.

AlbertMarashi commented 1 year ago

typescript definitions would be a lot more powerful I think

unaidedelf8777 commented 1 year ago

@musabgultekin , just gave you your first pr on the repo.

https://github.com/musabgultekin/functionary/pull/1

I would repeat what I said in the request, but my hand hurts from typing lol

dakinggg commented 6 months ago

Thank you all for the interesting discussion! I'm going to go ahead and close this issue as there is no question/feature request here.