Closed musabgultekin closed 6 months ago
Update:
instead of using seperate tool
role, i think its better to go with function
as the function response. And seperating function calls with different seperators. So we can use function calls in between messages.
<|im_start|>system
...FUNCTION_DEFINITION PART...<|im_end|>
<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Add X to my todo list<|im_end|>
<|im_start|>assistant
Sure! Lets add it.
<|fn_start|>todo.addTodo
{"text": "X"}<|fn_end|>
<|im_start|>function
{"status": "OK"}
<|im_start|>assistant
Okay, I've added!<|im_end|>
We are looking into tool use, thank you for sharing your work. I am curious, have you considered instead of
<|fn_start|>todo.addTodo
{"text": "X"}<|fn_end|>
<|im_start|>function
{"status": "OK"}<|im_end|>
My instinct is to have this be
<|im_start|>system function todo.addTodo
{"text": "X"}
response: {"status": "OK"}<|im_end|>
or something similar, since, won't the system block waiting for the response from the tool? So you can keep it in one message?
Also I am curious how you came up with the syntax for defining tools.
My initial thinking was having two EOS, both im_end and fn_end. So the sampling will stop there too. Let me know if you think one EOS is better for simplicity. Also, since I can define the roles as the owner of the message, it would be much more intuitive to have assistant own the function call, rather than the system. In your revised version, after the function call request, there is only "\n" as far as I see. so that makes it not suitable for generation stop. But maybe we can put the im_end after the function call request in your revised version.
Syntax is essentially simplified TypeScript declaration file (.d.ts). I got the inspiration from chatgpt plugins. The fact that the models trained on lots of "type" text data of typescript code makes this powerful enough to understand what is a "type" or signature of a function. Introducing some DSL that is completely different from the pretraining data would make it not understand well enough. Also note that functions always takes one single object parameter that is typed. So the model will know that it should output an object and not strings/ints seperated by some arbitrary seperator when passing to the functions.
Here is an extended version. But we dont need the implementation part, so we can get rid of it:
// Plugin for managing a TODO list, you can add, remove and view your TODOs.
namespace todo {
// Function to add a todo to the list
function addTodo(_: { todo: string; }): any {
// implementation here
}
} // namespace todo
Have you considered adopting the same function-defining schema that OpenAI uses? Personally, I haven't encountered any issues with this schema and believe that maintaining consistency might help others, since they won't need to redefine a function that was initially used with OpenAI's function calling when they use their functions with the MPT-30b function calling. Just a thought.
I'd caution against adding tokens if you can avoid it. I recently finetuned MPT-30b on a reddit-style dataset, and initially made heavy use of special tokens as separators (<|post_title|>title here<|post_author|>username...
.) I found that even after training on 150M tokens it would still associate these with their correct meaning very loosely. Retraining with plaintext separators ([Title] title here [Author] username...
) solved the issue completely.
Granted, I was going about it a bit excessively (10 or so custom tokens), but the poor results were surprising and lead me to believe that you'll be better off not adding <|fn_start|>
and <|fn_end|>
- especially if you don't have hundreds of millions of tokens to finetune on.
Have you considered adopting the same function-defining schema that OpenAI uses?
@unaidedelf8777 If you mean the JSON Schema Object definition, then that is something that you use on the API but not necessarily in the model. That JSON Schema definition can be converted to the schema that the model would use.
@float-trip That is a very useful information! What about special tokens that we would add to the dictionary? For example you would have added "<|post_title|>" as a seperate token that never existed in the pre-training step but it would exist in the fine-tuning step. So the model doesnt have any associations of "title", but it would know that its something different in the embedding space. same goes for <|fn_start|>
and others. It could require a custom tokenizer other than the default mpt30b though.
If I'm understanding right, that's what I did - I modified the tokenizer like this:
from transformers import GPTNeoXTokenizerFast
tokenizer = GPTNeoXTokenizerFast.from_pretrained(
"EleutherAI/gpt-neox-20b",
additional_special_tokens=[
"<|post_title|>",
"<|post_url|>",
"<|post_author|>",
# ...
],
)
tokenizer.save_pretrained("tokenizer")
The model was somewhat able to learn the correct meanings for these (especially for the more frequently occurring tokens like<|comment_author|>
, <|comment_body|>
, etc.) It's possible that only adding a couple new tokens would be fine, and the problem only appears when going overboard. But I'd at least do a test run without any new tokenization to see if it performs better.
Thanks @float-trip . Got it. Then we can remove the fn_start and instead used im_start with assistant role. (Like OpenAI API returns assistant role on function calls)
Then we can use the existing tokens that we have used but using the fields
for those roles. Basically an arbitrary map for every role that they just examples on chatml docs.
<|im_start|>system
...FUNCTION_DEFINITION PART...<|im_end|>
<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Add X to my todo list<|im_end|>
<|im_start|>assistant
Sure! Lets add.
<|im_start|>assistant to=todo.addTodo
{"text": "X"}<|im_end|>
<|im_start|>function name=todo.addTodo
{"status": "OK"}
<|im_start|>assistant
Okay, I've added!<|im_end|>
@musabgultekin, Yeah I don’t know anything about the tokenizers really. I just finished building out a massive system with the openai schemas, ( which it dynamically constructs ). I just didn’t want to rebuild that logic bc it’s quite complex, and I figured most others wouldn’t want to do similar things when migrating.
@musabgultekin @samhavens @float-trip @unaidedelf8777
You should all check out this twitter thread. It forces the LLM to output valid "code" following function/grammar specs by zeroing out probabilities for tokens which should not exist. It might be worth looking into.
The benefits are:
If the goal is to run parsable outputs for the given instruct, it solves the issue and a good solution since it doesn't require further fine-tuning. Really Like it.
This would work for function only models. For example if you have a for loop that decides what to do always, then this works. You can have a for loop for a robot, that it can only do "go_forward", "go_backwards" "wait" etc, then its good. But if it needs to decide not to call any functions but instead needs to ask followup questions, then it wont work. Fine-tuning is the way to go in that case.
Thanks for the info @AlbertMarashi ! Checking it out
@AlbertMarashi For a more general/flexible version of that idea, check out LogitsWarper
and LogitsProcessor
in HuggingFace's transformers library.
I've finally prepared a good looking dataset ready for fine-tuning. It has 5300~ examples. 500~ different schemas. All different prompts, %50 contains function calls, %50 has function schemas but no callsto teach the model when not to call the functions.
But 8xA100 40GB gave OOM. I've looked two clouds for 8xA100 80GB, but no availability. I'm gonna have to defer training until I find some big GPUs for now.
I have a bash script in here which provisions an 8xA100 80gb from LambdaLabs: https://gist.github.com/float-trip/679019a23f246b17d2dff9e2cf55c387
It generally still takes a few hours, but if you leave it running you'll get it eventually.
Oh amazing! @float-trip Will use it thanks 🙏
I’m sleeking on a dataset of function definitions, prompts to call the functions, example responses from the functions, and a message that the model gives based on the response from the function.
All I did was scrape a ton of openapi schemas from the apisguru repository, and then turned them into function definitions, then devised a gpt prompt to make up prompts to call those functions, example responses from the functions, and the model responses based on those function responses.
It is chewing away rn. Will ping y’all when I throw it on hugging face… more to come.
I've finally prepared a good looking dataset ready for fine-tuning. It has 5300~ examples. 500~ different schemas. All different prompts, %50 contains function calls, %50 has function schemas but no callsto teach the model when not to call the functions.
But 8xA100 40GB gave OOM. I've looked two clouds for 8xA100 80GB, but no availability. I'm gonna have to defer training until I find some big GPUs for now.
You should try RunPod.Io, their pricing is slightly higher than lambdas, but they usually have availability for basically everything. It says here that for a A100 SXM they only want $1.44 USD per card hour. Not too bad in my opinion.
One other thing, if you haven’t used gcp yet, they’ll give you like 300 dollars of credits for a free trial. You will probably have to get in touch with their team if you want more than one A100 though. But in my experience they’re pretty good about fast replies.
I’m sleeking on a dataset of function definitions, prompts to call the functions, example responses from the functions, and a message that the model gives based on the response from the function.
All I did was scrape a ton of openapi schemas from the apisguru repository, and then turned them into function definitions, then devised a gpt prompt to make up prompts to call those functions, example responses from the functions, and the model responses based on those function responses.
It is chewing away rn. Will ping y’all when I throw it on hugging face… more to come.
It will have around 27000 examples of functions and completions. No extra training to tell the model when to call the functions really, only how to.
@unaidedelf8777 Thats great! I'm wondering if you only used one function per example and thats how you reached to 27k. Cause the site has 2.5k APIs. Also, did you try making the requests? cause it would be bad if most of them are auth required.
If I'm understanding right, that's what I did - I modified the tokenizer like this:
from transformers import GPTNeoXTokenizerFast tokenizer = GPTNeoXTokenizerFast.from_pretrained( "EleutherAI/gpt-neox-20b", additional_special_tokens=[ "<|post_title|>", "<|post_url|>", "<|post_author|>", # ... ], ) tokenizer.save_pretrained("tokenizer")
The model was somewhat able to learn the correct meanings for these (especially for the more frequently occurring tokens like
<|comment_author|>
,<|comment_body|>
, etc.) It's possible that only adding a couple new tokens would be fine, and the problem only appears when going overboard. But I'd at least do a test run without any new tokenization to see if it performs better.
@float-trip I think you also needed to resize the embeddings too
Checkout : https://github.com/tatsu-lab/stanford_alpaca/blob/main/train.py#L65C42-L65C42
@unaidedelf8777 Thats great! I'm wondering if you only used one function per example and thats how you reached to 27k. Cause the site has 2.5k APIs. Also, did you try making the requests? cause it would be bad if most of them are auth required.
No, the function responses are synthetic, but were verified to make sure they have the correct schema. I have the dataset repo setup rn, it has the prompt I used in there. the repo is empty except for a readme and the prompt because the API is slow asl and I didn't feel like making it asynchronous. and to answer your question, yes currently it is only one prompt one function call, however I plan on iterating and improving the dataset soon enough.
IF this dataset picks up any traction, do you think that I should make a petreon, since openai is expensive for this kinda thing.
also here's the hf link
@unaidedelf8777 Thats great! I'm wondering if you only used one function per example and thats how you reached to 27k. Cause the site has 2.5k APIs. Also, did you try making the requests? cause it would be bad if most of them are auth required.
No, the function responses are synthetic, but were verified to make sure they have the correct schema. I have the dataset repo setup rn, it has the prompt I used in there. the repo is empty except for a readme and the prompt because the API is slow asl and I didn't feel like making it asynchronous. and to answer your question, yes currently it is only one prompt one function call, however I plan on iterating and improving the dataset soon enough.
IF this dataset picks up any traction, do you think that I should make a petreon, since openai is expensive for this kinda thing.
also here's the hf link
ill ping yall when its finished.
@musabgultekin @AlbertMarashi @float-trip @samhavens
Just uploaded the part of the dataset which is finished, its only around 1k examples though. more to come.
https://huggingface.co/datasets/unaidedelf87777/openapi-function-invocations-25k/
the preview is also messed up so it shows the prompt I used instead of the csv. if someone knows how to fix that please lmk.
@float-trip I think you also needed to resize the embeddings too
Checkout : https://github.com/tatsu-lab/stanford_alpaca/blob/main/train.py#L65C42-L65C42
Thanks, this is true in cases where model.config.vocab_size < len(tokenizer)
. GPT-NeoX-20B and the MPT models leave some extra room in the vocab size for performance reasons.
Hi @unaidedelf8777 , Thank you for the dataset! If possible, can you save your dataset in jsonlines format? (seperate json on each line). Cause apparently the CSV has problems with your data/saving method and its not valid. Currently it doesnt look like its usable unfortunately :(
Hi @unaidedelf8777 , Thank you for the dataset! If possible, can you save your dataset in jsonlines format? (seperate json on each line). Cause apparently the CSV has problems with your data/saving method and its not valid. Currently it doesnt look like its usable unfortunately :(
Yeah, sorry ab that. It’s because some entries span a couple lines. I have a cleaned version, I just am yet to throw it up there yet. If you look in the Data folder of the repo, there are train and Validation jsonl files, pre formatted with the special tokens which I described in the dataset card. I also updated to a new repo so here's the link
Hi @unaidedelf8777 , Thank you for the dataset! If possible, can you save your dataset in jsonlines format? (seperate json on each line). Cause apparently the CSV has problems with your data/saving method and its not valid. Currently it doesnt look like its usable unfortunately :(
Yeah, sorry ab that. It’s because some entries span a couple lines. I have a cleaned version, I just am yet to throw it up there yet. If you look in the Data folder of the repo, there are train and Validation jsonl files, pre formatted with the special tokens which I described in the dataset card.
I’m working on a loading script for the dataset right now which will serve the jsonl files. I just don’t know how to write it, and gpt seems not to know either so I’ll have to actually look into it ☹️
I've wrote a new training system that calculates loss on assistant responses and function calls from assistant. Used LLaMA 7B as base and Huggingface trainer. Trained it on 37k examples (34k ShareGPT conversations(wizard vicuna uncensored) + 3k GPT4 generated prompts and function calls). Sadly I wasn't able to find 8x A100 80GB consistently for the training experiments. It was stressful, cause I've already had the MPT training dataset, code and I simply wasn't able to find some compute to train it. So thats why I decided to start with a smaller model. (MPT 7B also failed on nodes that I have tried)
Here is the repo and inference code that works rn: https://github.com/musabgultekin/functionary/blob/main/inference.ipynb
My preliminary manual tests shows that its working on instructions. It knows when to call a function and which function. Will prepare the codebase. Need to write evaluation suite and custom tooling around it though.
There are issues of course; like multiple rounds not working properly due to the dataset is only has one round per conversation for function calls. Also its hallucinating when commentating over function outputs for some reason, e.g.: adding stuff that doesn't exist on the function outputs. I hope we can reduce it by training on 13B, more sharegpt convs and also potentially using @unaidedelf8777 's dataset.
Per @float-trip 's suggestion, I've not introduced new tokens (will need eval suite first, then will do the ablation study).
I'll share more details, design choices, and code soon. I feel like current status is just proof of concept. Please let me know what are your suggestions.
Here is how it decides to call functions:
Here is how it uses the function output:
@musabgultekin,
I'd recommend using runpod.io, I can consistently get 8 a6000 48gb cards with 1tb of ram and 64 CPU cores for like $6 a hour ( USD) its a ton better deal than providers like Lambdalabs, or google cloud.
Its been nearly a month since I started, but I managed to make function calling working properly with LLaMA. Unfortunately not MPT because of OOMs that I got constantly.
Here is the full repo for inference and details of dataset: https://github.com/musabgultekin/functionary
I'm gonna add training code and more info soon.
@unaidedelf8777 I saw your mpt-7b-CodeCaller-v1 . How is it going? Cause first, I trained a single turn conversation but it didnt work out well. So I had to use multiple turns in one conversation to make it actually work properly.
Its been nearly a month since I started, but I managed to make function calling working properly with LLaMA. Unfortunately not MPT because of OOMs that I got constantly.
Here is the full repo for inference and details of dataset: https://github.com/musabgultekin/functionary
I'm gonna add training code and more info soon.
@unaidedelf8777 I saw your mpt-7b-CodeCaller-v1 . How is it going? Cause first, I trained a single turn conversation but it didnt work out well. So I had to use multiple turns in one conversation to make it actually work properly.
@musabgultekin , that repo was just a failed LoRA fine-tune that I whipped out yesterday. I am still ironing out my dataset, and trying to really get rid of the useless entries. Right now My current plan is to add a few new attention heads on the upper layers, since I have noticed that the model really tries to call the functions/ doesn't know the format so it just gives mumbled garbage which it makes up. all I need to do is find on what layer/chunk of layers it is generating the garbage, and replace the garbage it makes up with a function call, directed by the attention head(s).
Also, did you get it working on Llama 1 or 2? just curious. love the functionary screaming llama on your repo also!
One last thing, I saw on the repo that you mentioned that you were saving up to train the model.. Have you seen the petals project? because they support the llama 70b models, and I imagine the others aswell, and its also completely free. only potential caveat I see is that it is distributed swarm training, and thus it might be slower, but I don't know, I have only performed inference on the network. definitely good to look into!
@unaidedelf8777 I' think its probably the model doesnt know the schema of fn_def section that you put in the dataset. Thats why I used TypeScript definitions, which exists in the pretraining dataset of these LLMs. Also Microsoft TypeChat uses the same TypeScript definitions idea: https://github.com/microsoft/TypeChat/blob/d2f2de9ca37ef9adeb108d5fc60703b72fec0a22/site/src/blog/introducing-typechat.md#just-add-types So I don't think adding new attention heads would be sufficient.
If you are convinced, I just committed a function for you that would help you on conversion from apisguru specs to typescript schemas. https://github.com/musabgultekin/functionary/commit/6cde13ca40be1ca4e873955c6d15e8969a578c50 Of yourse you can modify the schema as you need.
Its based on llama 1 not 2. I'm currently checking out llama 2 training, doing some experiments etc.
Thanks for the petals project info! i thought its only for forward pass. Thats great to know, will check out.
@unaidedelf8777 I' think its probably the model doesnt know the schema of fn_def section that you put in the dataset. Thats why I used TypeScript definitions, which exists in the pretraining dataset of these LLMs. Also Microsoft TypeChat uses the same TypeScript definitions idea: https://github.com/microsoft/TypeChat/blob/d2f2de9ca37ef9adeb108d5fc60703b72fec0a22/site/src/blog/introducing-typechat.md#just-add-types So I don't think adding new attention heads would be sufficient.
If you are convinced, I just committed a function for you that would help you on conversion from apisguru specs to typescript schemas. musabgultekin/functionary@6cde13c Of yourse you can modify the schema as you need.
Its based on llama 1 not 2. I'm currently checking out llama 2 training, doing some experiments etc.
Thanks for the petals project info! i thought its only for forward pass. Thats great to know, will check out.
Looking at the training mixture of llama a little closer, I do see what your saying, and the typescript definitions definitely do make a lot more sense when looking at them with no understanding of json. only reason I am really pushing with json definitions is because it is simpler in my opinion. keep in mind that's the opinion of somebody who knows nothing about typescript, but yeah.
Update: I just looked more into how typescript definitions work, and I am definitely on board with the idea, as It does seem simple enough to convert the json schema I was using into typescript definitions.
typescript definitions would be a lot more powerful I think
@musabgultekin , just gave you your first pr on the repo.
https://github.com/musabgultekin/functionary/pull/1
I would repeat what I said in the request, but my hand hurts from typing lol
Thank you all for the interesting discussion! I'm going to go ahead and close this issue as there is no question/feature request here.
Hi, I'm working on fine-tuning the MPT-30B for function calling. Currently still preparing the fine-tuning dataset. AFAIK there is no open-source fine-tuned model for function support(let me know if you know).
I'm not sure about the format and wanted to get some advice. Basically proposing to add "function" and "tool" role to the original ChatML format. And putting the function schema to the system prompt. Here is the starting context. Functions are converted to a schema and have been put as a system prompt. (we can change this to something like "function_definitions" etc) The model should start generating either "assistant" or "function" with after this exact context:
If it starts with "function", it'll generate the namespace.function definition. Like this:
Then we'll give this to the user. OR as a library, we can actually call the function too. The function response's role can be named as "tool" like this:
Full raw format:
What do you think?