huggingface / swift-transformers

Swift Package to implement a transformers-like API in Swift
Apache License 2.0
657 stars 70 forks source link

Support Chat Template #77

Closed maiqingqiang closed 2 months ago

maiqingqiang commented 6 months ago

I want to port https://github.com/huggingface/transformers/blob/838b87abe231fd70be5132088d0dee72a7bb8d62/src/transformers/tokenization_utils_base.py#L1693

https://huggingface.co/docs/transformers/chat_templating#introduction

But I'm not sure if this implementation is correct, and I hope you can give me some advice. If the idea is correct, I will continue to develop.

@pcuenca

func testGemmaTokenizerChatTemplate() async throws {
        let tokenizer = try await AutoTokenizer.from(pretrained: "mlx-community/quantized-gemma-2b-it")
        print(tokenizer.applyChatTemplate(messages: ["user":"hi"], addGenerationPrompt: false))
}
pcuenca commented 6 months ago

Hi @maiqingqiang!

That'd be a great feature to have! πŸ”₯ However, tokenizer templates are different across models, and have finicky details in their use of bos tokens, the availability of system prompts, the use of variables, etc. We'd have to be very careful in each tokenizer's implementation to ensure it matches the reference. Furthermore, nothing prevents fine-tuned models from using different templates than the base model they were trained on, making things even more complicated.

I believe the best approach would be to read the chat template from the tokenizer configuration, like this one: https://huggingface.co/google/gemma-2b-it/blob/main/tokenizer_config.json#L59

The template uses a subset of the Jinja2 syntax. Jinja2 is originally a Python engine, but it has been ported to other languages. I like @xenova's recent port to JavaScript, which focuses on the Jinja features required to parse these chat templates.

Is this approach something you'd be interested in pursuing? If so, my recommendation would be to follow @xenova's code and try to adapt it to idiomatic Swift. I'd do this in a separate SPM target because it could also be used independently. If you do decide to give it a go, we can provide some guidance and help you during the process!

maiqingqiang commented 6 months ago

Hi @pcuenca !

Thank you for your help!

My initial thought was also to read the tokenizer_config.json. However, I abandoned that idea because Swift doesn't support Jinja2. Next, I plan to attempt porting Jinja2 to Swift, although this task might be time-consuming. If I succeed in porting it, I will continue with the development.

pcuenca commented 6 months ago

Cool, let us know how it goes! Happy to help next week if you need it.

maiqingqiang commented 6 months ago

Cool, let us know how it goes! Happy to help next week if you need it.

Currently, only the development of Lexer.tokenize has been completed.

https://github.com/maiqingqiang/Jinja

image
maiqingqiang commented 6 months ago

Currently, Jinja Swift is able to render a small amount of Jinja Templates, and a large number of Jinja Syntax are being adapted!

https://github.com/maiqingqiang/Jinja/blob/main/Tests/JinjaTests/InterpreterTests.swift

image
pcuenca commented 6 months ago

Very nice progress @maiqingqiang! πŸ™Œ Let us know when you want a review :)

maiqingqiang commented 6 months ago

Very nice progress @maiqingqiang! πŸ™Œ Let us know when you want a review :)

Many thanks to @pcuenca.

Jinja Swift can now render chat templates normally. There are still some Jinja Syntax that haven't been completed, but I think it's already sufficient for the chat template. The remaining Jinja Syntax will be gradually completed.

It is ready for review now. πŸ™Œ

https://github.com/maiqingqiang/Jinja/blob/main/Tests/JinjaTests/ChatTemplateTests.swift

image
Proryanator commented 5 months ago

I'm very interested in this work and would like to contribute! Thanks for opening this issue and getting the work done. Even if you're only able to add in a generic style prompt template, with a few hard-implementations, we can open some additional issues after that to add in more prompttemplate variants πŸ‘

Much like llama.cpp has, or LangChain as well.

maiqingqiang commented 5 months ago

I'm very interested in this work and would like to contribute! Thanks for opening this issue and getting the work done. Even if you're only able to add in a generic style prompt template, with a few hard-implementations, we can open some additional issues after that to add in more prompttemplate variants πŸ‘

Much like llama.cpp has, or LangChain as well.

Looking forward to your joining!🀝

maiqingqiang commented 4 months ago

Apologies for the delay in updates, I've been quite occupied recently. Likely, I won't be able to post any until June.

maiqingqiang commented 2 months ago

My apologies @maiqingqiang, I thought this was still in progress but it's in really great shape! I think we can merge it very quickly.

I tested it with the following test class:

import XCTest
import Hub
import Tokenizers
import Jinja

class TemplateTests: XCTestCase {
    let mistralTemplate = "{{bos_token}}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if message['role'] == 'user' %}{{ '[INST] ' + message['content'] + '[/INST]' }}{% elif message['role'] == 'assistant' %}{{ ' ' + message['content'] + eos_token}}{% else %}{{ raise_exception('Only user and assistant roles are supported!') }}{% endif %}{% endfor %}"

    let messages = [
        ["role": "user", "content": "Hello, how are you?"],
        ["role": "assistant", "content": "I'm doing great. How can I help you today?"],
    ]

    func testMistralTemplateRender() throws {
        let template = try Template(mistralTemplate)
        let context: [String : Any] = [
            "bos_token": "<s>",
            "eos_token": "</s>",
            "messages": messages,
        ]
        let rendered = try template.render(context)
        let expected = "<s>[INST] Hello, how are you?[/INST] I'm doing great. How can I help you today?</s>"
        XCTAssertEqual(rendered, expected)
    }

    func testMistralTemplateApply() async throws {
        // Copy of Mistral tokenizer, for simplicity (does not require gating)
        let tokenizer = try await AutoTokenizer.from(pretrained: "pcuenq/mistral-0.3-tokenizer")
        let encoded = try tokenizer.applyChatTemplate(messages: messages)
        let expected = [1, 3, 23325, 29493, 1678, 1228, 1136, 29572, 4, 1083, 29510, 29487, 3316, 2366, 29491, 2370, 1309, 1083, 2084, 1136, 3922, 29572, 2]
        XCTAssertEqual(encoded, expected)
    }
}

The first test passes, but the second one does not because tokenization unconditionally adds an additional bos token. I don't think this is a problem this PR introduced; instead, I think we may need to add some logic to indicate the tokenizer whether special tokens need to be added or skipped. I can help you investigate and track down the way it works in the Python codebase.

I would recommend you add a test class similar to the one I posted. Or, for easier collaboration (I can't push to your branch), I can create a new PR and temporarily give you write permissions to the repo so we can both work on the same branch. Would that work for you, or do you have other suggestions to collaborate?

Of course, you can create a new PR directly. It's up to you. Looking forward to working with you πŸ’ͺ. However, I might be a bit busy during this period, so I may need some time before addressing your suggestions.

maiqingqiang commented 2 months ago

Perhaps the expected is incorrect?

image

I directly used the Python version of transformers, and the encoded result is consistent.

from transformers import AutoTokenizer

prompt = "<s>[INST] Hello, how are you?[/INST] I'm doing great. How can I help you today?</s>"

print("pcuenq/mistral-0.3-tokenizer:", AutoTokenizer.from_pretrained("pcuenq/mistral-0.3-tokenizer")(prompt))
print("mistralai/Mistral-7B-v0.3:", AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.3")(prompt))

Output:

pcuenq/mistral-0.3-tokenizer: {'input_ids': [1, 1, 3, 23325, 29493, 1678, 1228, 1136, 29572, 4, 1083, 29510, 29487, 3316, 2366, 29491, 2370, 1309, 1083, 2084, 1136, 3922, 29572, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
mistralai/Mistral-7B-v0.3: {'input_ids': [1, 1, 3, 23325, 29493, 1678, 1228, 1136, 29572, 4, 1083, 29510, 29487, 3316, 2366, 29491, 2370, 1309, 1083, 2084, 1136, 3922, 29572, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

@pcuenca

pcuenca commented 2 months ago

Perhaps the expected is incorrect?

Hi @maiqingqiang! The default behaviour of the tokenizer is to add a bos token, so it indeed prepends a 1 (which is the bos token id in this tokenizer) when you tokenize a string manually.

However, apply_chat_template never adds special tokens to give full control to the template. This is the Python code I tested, and the actual tokenized input that is expected by the model:

>>> from transformers import AutoTokenizer
>>> t = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.3")
>>> messages = [ {"role": "user", "content": "Hello, how are you?"}, {"role": "assistant", "content": "I'm doing great. How can I help you today?"} ]
>>> t.apply_chat_template(messages)
[1, 3, 23325, 29493, 1678, 1228, 1136, 29572, 4, 1083, 29510, 29487, 3316, 2366, 29491, 2370, 1309, 1083, 2084, 1136, 3922, 29572, 2]

To achieve this in Swift, we need to add support for an additional parameter to override the default behaviour when necessary. I have a local branch with the changes that I'll push shortly :)

maiqingqiang commented 2 months ago

Perhaps the expected is incorrect?

Hi @maiqingqiang! The default behaviour of the tokenizer is to add a bos token, so it indeed prepends a 1 (which is the bos token id in this tokenizer) when you tokenize a string manually.

However, apply_chat_template never adds special tokens to give full control to the template. This is the Python code I tested, and the actual tokenized input that is expected by the model:

>>> from transformers import AutoTokenizer
>>> t = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.3")
>>> messages = [ {"role": "user", "content": "Hello, how are you?"}, {"role": "assistant", "content": "I'm doing great. How can I help you today?"} ]
>>> t.apply_chat_template(messages)
[1, 3, 23325, 29493, 1678, 1228, 1136, 29572, 4, 1083, 29510, 29487, 3316, 2366, 29491, 2370, 1309, 1083, 2084, 1136, 3922, 29572, 2]

To achieve this in Swift, we need to add support for an additional parameter to override the default behaviour when necessary. I have a local branch with the changes that I'll push shortly :)

I seeπŸ˜„

maiqingqiang commented 2 months ago

Move #104