Support FIM for models using ChatML format

ChrisDeadman commented 7 months ago

First of all: Your extension is awesome, thanks for all your effort in making it better constantly! 👍🏼

FIM doesn't work for Mistral-7B-Instruct-v0.2-code-ft. I know that the ChatML format is mostly suited for turn-based conversations. However, except for the suggestions, you've already refactored your code to use the turn-based Ollama endpoint...

I get the reason why you have to use the generate endpoint of Ollama for FIM, which sucks a bit because you have to support all the different turn-templates in that case 🫤 This is still a big issue for all client applications that want to control models to answer in a specific way.

If you would try out ChatML support tho, you could try appending the start of the expected response from the model after the template, e.g.:

<|im_start|>system
You are an awesome coder, auto-complete the following code:<|im_end|>
<|im_start|>user
here goes the code<|im_end|>
<|im_start|>assistant
Sure, here is the auto-completion:   <-- the model will think it answered like that
``` <-- followed by three backticks and a newline to force the model to generate code.

I have tried this out manually and it works. Basically everything you write after <|im_start|>assistant will make the model think it started it's answer like that (works like that for basically all models, not just for ChatML-based models).

For reference, this is the correct huggingface tokenizer template for ChatML:

{% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}
{% for message in messages %}{{'<|im_start|>' + message['role'] + '\\n' + message['content'] + <|im_end|> + '\\n'}}{% endfor %}
{% if add_generation_prompt %}{{ '<|im_start|>assistant\\n' }}{% endif %}

rjmacarthy commented 7 months ago

Thanks @ChrisDeadman.

This is very interesting I will look into it for a future release.

rjmacarthy commented 6 months ago

Hey @ChrisDeadman I tried this but didn't get much luck, how can we deal with prefix and suffix, can we write a hbs template for it?

ChrisDeadman commented 6 months ago

Yes that should be possible, would make it easier for me to try out - if I find something usable I could make PR of a chatml template then. But as far as I can see in the code under src/extension/fim-templates.ts, hbs templates are not yet supported for fim?

ChrisDeadman commented 6 months ago

on second thought, the templates need to support some kind of flag which tells your response parser that the last part of the template (e.g. the 3 backticks) should be prepended to the actual model response before parsing it. Otherwise the stuff we suggested the model should start it's response with is missing.

ChrisDeadman commented 6 months ago

Under python using huggingface transformer templates you can get the "generation prompt" like this:

generation_prompt = self.tokenizer.apply_chat_template([], add_generation_prompt=True)

Because the [] represents an empty list of messages and the template does check for the add_generation_prompt variable:

{% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}
{% for message in messages %}{{'<|im_start|>' + message['role'] + '\\n' + message['content'] + <|im_end|> + '\\n'}}{% endfor %}
{% if add_generation_prompt %}{{ '<|im_start|>assistant\\n' }}{% endif %}

So only the generation prompt is returned. It works because it wants the input to be of the message type that is also used for the chat endpoint, and, if the list is empty, returns just the generation prompt (the part that the model should start the generation with).

Maybe something similar could be done with the hbs templates.

rjmacarthy commented 6 months ago

Hey, sorry but I'm still unsure about it. If you could adapt it to an hbs template it might be clearer? I did try to adapt the fim completions to use templates but didn't know what format it should be.

hafriedlander commented 6 months ago

Ollama automatically wraps whatever you pass to the /generate endpoint with a template (unless you turn it off with raw: true).

The default mistral template is pretty boring - https://ollama.com/library/mistral:latest/blobs/e6836092461f - but a lot of them follow that same format - https://ollama.com/library/dolphincoder:latest/blobs/62fbfd9ed093.

The ollama generate endpoint does allow overriding both the default model template and system message. Not sure if other systems (like vllm) do though.

ChrisDeadman commented 6 months ago

Hey, sorry but I'm still unsure about it. If you could adapt it to an hbs template it might be clearer? I did try to adapt the fim completions to use templates but didn't know what format it should be.

If I understand the hbs syntax correctly this should work for your existing fim stuff:

{{#if prefix}}
  <PRE> {{prefix}} 
{{/if}}

{{#if suffix}}
  <SUF> {{suffix}} 
{{/if}}

{{#if add_generation_prompt}}
  <MID>
{{/if}}

Just supply the 3 variables as args or pass only add_generation_prompt if you just want to get the start of the model response.

ChrisDeadman commented 6 months ago

For chatml it could be something like this (not tested):

{{#if system}}
  <|im_start|>system\n{{system}}<|im_end|>\n
{{/if}}

{{#if prefix || suffix}}
  <|im_start|>user\nPlease generate the code between the following prefix and suffix.\n
  {{#if prefix}}
    Prefix:\n```\n{{prefix}}\n```\n
  {{/if}}

  {{#if suffix}}
    Suffix:\n```\n{{suffix}}\n```\n
  {{/if}}
  <|im_end|>\n
{{/if}}

{{#if add_generation_prompt}}
  <|im_start|>assistant\n```\n
{{/if}}

hafriedlander commented 6 months ago

I understand I believe. I'll add another template into my PR (https://github.com/rjmacarthy/twinny/pull/174) on the next update

@ChrisDeadman are you using ollama as the backend? (Or if not, what are you using?)

ChrisDeadman commented 6 months ago

@ChrisDeadman are you using ollama as the backend? (Or if not, what are you using?)

I wrote a custom server - I added ollama compatible API to run this extension over it. Internally, it uses huggingface templates to tokenize the chat completion messages. It however does not apply any templates to the prompt passed to the /generate endpoint.

hafriedlander commented 6 months ago

Ah. Ollama does normally wrap the prompt passed to /generate with a model specific template, unless raw: true is part of the request (which twinny doesn't currently set).

I think probably all autocomplete requests should use raw: true though - at least starcoder2 requires it, and my PR currently assumes all models should use it.

Long term, it'd be cool to be able to edit both the chat and FIM templates as HBS in vs code the same way command templates currently can be. For now I'll just add a extra template to the code.

rjmacarthy commented 6 months ago

Hey,

FYI, this should now work with any ChatML endpoint as the provider as I added the ability to edit and choose a custom FIM template. The first test would be using OpenAI API through LiteLLM with GPT3.5 or GPT4. I am still unsure if Ollama support ChatML? Also should I add raw: true to the options, or make it an option to allow raw option in settings?

Here is a template I have been using with GPT-4 with pretty good success:

<|im_start|>system
You are a auto-completion coding assistant who uses a prefix and suffix to "fill_in_middle".<|im_end|>
<|im_start|>user
<prefix>{{{prefix}}}<fill_in_middle>{{{suffix}}}<end>
Only reply with pure code, no backticks, do not repeat code in the prefix or suffix, match brackets carefully.<|im_end|>
<|im_start|>assistant
Sure, here is the pure code auto-completion:

ChrisDeadman commented 6 months ago

imo this looks like a great approach 👍🏼 According raw, I second what @hafriedlander said after RTFMing the docs

CartoonFan commented 5 months ago

Sorry if I just missed it, but I don't really understand how to make this work. I'm currently running a fairly large model through Ollama (https://ollama.com/wojtek/beyonder), and it'd be great if I could use it for FIM as well.

Some additional info:

Editor: VSCodium OS: Arch Linux GPU: AMD Radeon RX 6800 XT (16 GB) CPU: AMD Ryzen 7 3700X RAM: 48 GB

Model's HuggingFace page: https://huggingface.co/mlabonne/Beyonder-4x7B-v3-GGUF

I haven't been able to get inline completion working at all, with any model
I don't know where the debug logs and config files are located
Chat is through Ollama model -> LiteLLM, FIM is directly through Ollama

Thanks :sweat_smile: :pray: :purple_heart:

ChrisDeadman commented 5 months ago

So I tested this brifly with Llama-3 by selecting "custom template" in the FIM Provider settings and modifying the templates like so:

fim-system.hbs

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a helpful, respectful and honest coding assistant.
Always reply with using markdown.<|eot_id|>

fim.hbs

{{{systemMessage}}}<|start_header_id|>user<|end_header_id|>

Please respond with the code that is missing here:

```{{language}}
{{{prefix}}}<MISSING CODE>
{{{suffix}}}
```<|eot_id|><|start_header_id|>assistant<|end_header_id|>

```{{language}}

What I found is:

The language variable is resolved to an empty string in fim.hbs
The response is not trimmed
The trailing backticks are not cut from the response

But other than that it seems to work :smiley:

Here is a screenshot:

It would be nice to be able to repeat the last line of the prefix as the model response (e.g. thread. in my example), to make the model not repeat it. (by providing a {{getLastLine prefix}} function for example)

rjmacarthy commented 5 months ago

Thanks @ChrisDeadman I think this can be arranged. Is there anything else datawise you'd like passed to the template?

Many thanks,

ChrisDeadman commented 5 months ago

Thanks @rjmacarthy ! I cannot think of anything else that is missing at the moment, should be enough to support chatml and llama-3 templates imo.

rjmacarthy commented 5 months ago

I've added the language to the template now, but I think there is some inconsistencies with how it works still which I need to iron out. I had mixed results still with llama3:8b.

ChrisDeadman commented 5 months ago

Did you also add an option to get the last line of prefix in the template? When adding this to the end of the template, the results should be much better.

rjmacarthy commented 5 months ago

No I didn't actually, I can add it.

ChrisDeadman commented 5 months ago

That would be awesome, I will do some tests when ready :smiley:

twinnydotdev / twinny

Support FIM for models using ChatML format #142