Open ChrisDeadman opened 7 months ago
Thanks @ChrisDeadman.
This is very interesting I will look into it for a future release.
Hey @ChrisDeadman I tried this but didn't get much luck, how can we deal with prefix and suffix, can we write a hbs
template for it?
Yes that should be possible, would make it easier for me to try out - if I find something usable I could make PR of a chatml template then. But as far as I can see in the code under src/extension/fim-templates.ts, hbs templates are not yet supported for fim?
on second thought, the templates need to support some kind of flag which tells your response parser that the last part of the template (e.g. the 3 backticks) should be prepended to the actual model response before parsing it. Otherwise the stuff we suggested the model should start it's response with is missing.
Under python using huggingface transformer templates you can get the "generation prompt" like this:
generation_prompt = self.tokenizer.apply_chat_template([], add_generation_prompt=True)
Because the []
represents an empty list of messages and the template does check for the add_generation_prompt
variable:
{% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}
{% for message in messages %}{{'<|im_start|>' + message['role'] + '\\n' + message['content'] + <|im_end|> + '\\n'}}{% endfor %}
{% if add_generation_prompt %}{{ '<|im_start|>assistant\\n' }}{% endif %}
So only the generation prompt is returned. It works because it wants the input to be of the message type that is also used for the chat endpoint, and, if the list is empty, returns just the generation prompt (the part that the model should start the generation with).
Maybe something similar could be done with the hbs templates.
Hey, sorry but I'm still unsure about it. If you could adapt it to an hbs template it might be clearer? I did try to adapt the fim completions to use templates but didn't know what format it should be.
Ollama automatically wraps whatever you pass to the /generate endpoint with a template (unless you turn it off with raw: true
).
The default mistral template is pretty boring - https://ollama.com/library/mistral:latest/blobs/e6836092461f - but a lot of them follow that same format - https://ollama.com/library/dolphincoder:latest/blobs/62fbfd9ed093.
The ollama generate endpoint does allow overriding both the default model template and system message. Not sure if other systems (like vllm) do though.
Hey, sorry but I'm still unsure about it. If you could adapt it to an hbs template it might be clearer? I did try to adapt the fim completions to use templates but didn't know what format it should be.
If I understand the hbs syntax correctly this should work for your existing fim stuff:
{{#if prefix}}
<PRE> {{prefix}}
{{/if}}
{{#if suffix}}
<SUF> {{suffix}}
{{/if}}
{{#if add_generation_prompt}}
<MID>
{{/if}}
Just supply the 3 variables as args or pass only add_generation_prompt
if you just want to get the start of the model response.
For chatml it could be something like this (not tested):
{{#if system}}
<|im_start|>system\n{{system}}<|im_end|>\n
{{/if}}
{{#if prefix || suffix}}
<|im_start|>user\nPlease generate the code between the following prefix and suffix.\n
{{#if prefix}}
Prefix:\n```\n{{prefix}}\n```\n
{{/if}}
{{#if suffix}}
Suffix:\n```\n{{suffix}}\n```\n
{{/if}}
<|im_end|>\n
{{/if}}
{{#if add_generation_prompt}}
<|im_start|>assistant\n```\n
{{/if}}
I understand I believe. I'll add another template into my PR (https://github.com/rjmacarthy/twinny/pull/174) on the next update
@ChrisDeadman are you using ollama as the backend? (Or if not, what are you using?)
@ChrisDeadman are you using ollama as the backend? (Or if not, what are you using?)
I wrote a custom server - I added ollama compatible API to run this extension over it. Internally, it uses huggingface templates to tokenize the chat completion messages. It however does not apply any templates to the prompt passed to the /generate endpoint.
Ah. Ollama does normally wrap the prompt passed to /generate with a model specific template, unless raw: true
is part of the request (which twinny doesn't currently set).
I think probably all autocomplete requests should use raw: true
though - at least starcoder2 requires it, and my PR currently assumes all models should use it.
Long term, it'd be cool to be able to edit both the chat and FIM templates as HBS in vs code the same way command templates currently can be. For now I'll just add a extra template to the code.
Hey,
FYI, this should now work with any ChatML endpoint as the provider as I added the ability to edit and choose a custom FIM template. The first test would be using OpenAI API through LiteLLM with GPT3.5 or GPT4. I am still unsure if Ollama support ChatML? Also should I add raw: true
to the options, or make it an option to allow raw
option in settings?
Here is a template I have been using with GPT-4 with pretty good success:
<|im_start|>system
You are a auto-completion coding assistant who uses a prefix and suffix to "fill_in_middle".<|im_end|>
<|im_start|>user
<prefix>{{{prefix}}}<fill_in_middle>{{{suffix}}}<end>
Only reply with pure code, no backticks, do not repeat code in the prefix or suffix, match brackets carefully.<|im_end|>
<|im_start|>assistant
Sure, here is the pure code auto-completion:
imo this looks like a great approach 👍🏼
According raw
, I second what @hafriedlander said after RTFMing the docs
Sorry if I just missed it, but I don't really understand how to make this work. I'm currently running a fairly large model through Ollama (https://ollama.com/wojtek/beyonder), and it'd be great if I could use it for FIM as well.
Some additional info:
Editor: VSCodium OS: Arch Linux GPU: AMD Radeon RX 6800 XT (16 GB) CPU: AMD Ryzen 7 3700X RAM: 48 GB
Model's HuggingFace page: https://huggingface.co/mlabonne/Beyonder-4x7B-v3-GGUF
Thanks :sweat_smile: :pray: :purple_heart:
So I tested this brifly with Llama-3 by selecting "custom template" in the FIM Provider settings and modifying the templates like so:
fim-system.hbs
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are a helpful, respectful and honest coding assistant.
Always reply with using markdown.<|eot_id|>
fim.hbs
{{{systemMessage}}}<|start_header_id|>user<|end_header_id|>
Please respond with the code that is missing here:
```{{language}}
{{{prefix}}}<MISSING CODE>
{{{suffix}}}
```<|eot_id|><|start_header_id|>assistant<|end_header_id|>
```{{language}}
What I found is:
language
variable is resolved to an empty string in fim.hbs
But other than that it seems to work :smiley:
Here is a screenshot:
It would be nice to be able to repeat the last line of the prefix as the model response (e.g. thread.
in my example), to make the model not repeat it. (by providing a {{getLastLine prefix}}
function for example)
Thanks @ChrisDeadman I think this can be arranged. Is there anything else datawise you'd like passed to the template?
Many thanks,
Thanks @rjmacarthy ! I cannot think of anything else that is missing at the moment, should be enough to support chatml and llama-3 templates imo.
I've added the language to the template now, but I think there is some inconsistencies with how it works still which I need to iron out. I had mixed results still with llama3:8b.
Did you also add an option to get the last line of prefix
in the template? When adding this to the end of the template, the results should be much better.
No I didn't actually, I can add it.
That would be awesome, I will do some tests when ready :smiley:
First of all: Your extension is awesome, thanks for all your effort in making it better constantly! 👍🏼
FIM doesn't work for Mistral-7B-Instruct-v0.2-code-ft. I know that the ChatML format is mostly suited for turn-based conversations. However, except for the suggestions, you've already refactored your code to use the turn-based Ollama endpoint...
I get the reason why you have to use the generate endpoint of Ollama for FIM, which sucks a bit because you have to support all the different turn-templates in that case 🫤 This is still a big issue for all client applications that want to control models to answer in a specific way.
If you would try out ChatML support tho, you could try appending the start of the expected response from the model after the template, e.g.:
I have tried this out manually and it works. Basically everything you write after
<|im_start|>assistant
will make the model think it started it's answer like that (works like that for basically all models, not just for ChatML-based models).For reference, this is the correct huggingface tokenizer template for ChatML: