Closed maiqingqiang closed 2 months ago
Hi @maiqingqiang!
That'd be a great feature to have! π₯ However, tokenizer templates are different across models, and have finicky details in their use of bos
tokens, the availability of system prompts, the use of variables, etc. We'd have to be very careful in each tokenizer's implementation to ensure it matches the reference. Furthermore, nothing prevents fine-tuned models from using different templates than the base model they were trained on, making things even more complicated.
I believe the best approach would be to read the chat template from the tokenizer configuration, like this one: https://huggingface.co/google/gemma-2b-it/blob/main/tokenizer_config.json#L59
The template uses a subset of the Jinja2 syntax. Jinja2 is originally a Python engine, but it has been ported to other languages. I like @xenova's recent port to JavaScript, which focuses on the Jinja features required to parse these chat templates.
Is this approach something you'd be interested in pursuing? If so, my recommendation would be to follow @xenova's code and try to adapt it to idiomatic Swift. I'd do this in a separate SPM target because it could also be used independently. If you do decide to give it a go, we can provide some guidance and help you during the process!
Hi @pcuenca !
Thank you for your help!
My initial thought was also to read the tokenizer_config.json. However, I abandoned that idea because Swift doesn't support Jinja2. Next, I plan to attempt porting Jinja2 to Swift, although this task might be time-consuming. If I succeed in porting it, I will continue with the development.
Cool, let us know how it goes! Happy to help next week if you need it.
Cool, let us know how it goes! Happy to help next week if you need it.
Currently, only the development of Lexer.tokenize
has been completed.
Currently, Jinja Swift is able to render a small amount of Jinja Templates, and a large number of Jinja Syntax are being adapted!
https://github.com/maiqingqiang/Jinja/blob/main/Tests/JinjaTests/InterpreterTests.swift
Very nice progress @maiqingqiang! π Let us know when you want a review :)
Very nice progress @maiqingqiang! π Let us know when you want a review :)
Many thanks to @pcuenca.
Jinja Swift can now render chat templates normally. There are still some Jinja Syntax that haven't been completed, but I think it's already sufficient for the chat template. The remaining Jinja Syntax will be gradually completed.
It is ready for review now. π
https://github.com/maiqingqiang/Jinja/blob/main/Tests/JinjaTests/ChatTemplateTests.swift
I'm very interested in this work and would like to contribute! Thanks for opening this issue and getting the work done. Even if you're only able to add in a generic style prompt template, with a few hard-implementations, we can open some additional issues after that to add in more prompttemplate variants π
Much like llama.cpp has, or LangChain as well.
I'm very interested in this work and would like to contribute! Thanks for opening this issue and getting the work done. Even if you're only able to add in a generic style prompt template, with a few hard-implementations, we can open some additional issues after that to add in more prompttemplate variants π
Much like llama.cpp has, or LangChain as well.
Looking forward to your joiningοΌπ€
Apologies for the delay in updates, I've been quite occupied recently. Likely, I won't be able to post any until June.
My apologies @maiqingqiang, I thought this was still in progress but it's in really great shape! I think we can merge it very quickly.
I tested it with the following test class:
import XCTest import Hub import Tokenizers import Jinja class TemplateTests: XCTestCase { let mistralTemplate = "{{bos_token}}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if message['role'] == 'user' %}{{ '[INST] ' + message['content'] + '[/INST]' }}{% elif message['role'] == 'assistant' %}{{ ' ' + message['content'] + eos_token}}{% else %}{{ raise_exception('Only user and assistant roles are supported!') }}{% endif %}{% endfor %}" let messages = [ ["role": "user", "content": "Hello, how are you?"], ["role": "assistant", "content": "I'm doing great. How can I help you today?"], ] func testMistralTemplateRender() throws { let template = try Template(mistralTemplate) let context: [String : Any] = [ "bos_token": "<s>", "eos_token": "</s>", "messages": messages, ] let rendered = try template.render(context) let expected = "<s>[INST] Hello, how are you?[/INST] I'm doing great. How can I help you today?</s>" XCTAssertEqual(rendered, expected) } func testMistralTemplateApply() async throws { // Copy of Mistral tokenizer, for simplicity (does not require gating) let tokenizer = try await AutoTokenizer.from(pretrained: "pcuenq/mistral-0.3-tokenizer") let encoded = try tokenizer.applyChatTemplate(messages: messages) let expected = [1, 3, 23325, 29493, 1678, 1228, 1136, 29572, 4, 1083, 29510, 29487, 3316, 2366, 29491, 2370, 1309, 1083, 2084, 1136, 3922, 29572, 2] XCTAssertEqual(encoded, expected) } }
The first test passes, but the second one does not because tokenization unconditionally adds an additional
bos
token. I don't think this is a problem this PR introduced; instead, I think we may need to add some logic to indicate the tokenizer whether special tokens need to be added or skipped. I can help you investigate and track down the way it works in the Python codebase.I would recommend you add a test class similar to the one I posted. Or, for easier collaboration (I can't push to your branch), I can create a new PR and temporarily give you write permissions to the repo so we can both work on the same branch. Would that work for you, or do you have other suggestions to collaborate?
Of course, you can create a new PR directly. It's up to you. Looking forward to working with you πͺ. However, I might be a bit busy during this period, so I may need some time before addressing your suggestions.
Perhaps the expected
is incorrect?
I directly used the Python version of transformers, and the encoded result is consistent.
from transformers import AutoTokenizer
prompt = "<s>[INST] Hello, how are you?[/INST] I'm doing great. How can I help you today?</s>"
print("pcuenq/mistral-0.3-tokenizer:", AutoTokenizer.from_pretrained("pcuenq/mistral-0.3-tokenizer")(prompt))
print("mistralai/Mistral-7B-v0.3:", AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.3")(prompt))
Output:
pcuenq/mistral-0.3-tokenizer: {'input_ids': [1, 1, 3, 23325, 29493, 1678, 1228, 1136, 29572, 4, 1083, 29510, 29487, 3316, 2366, 29491, 2370, 1309, 1083, 2084, 1136, 3922, 29572, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
mistralai/Mistral-7B-v0.3: {'input_ids': [1, 1, 3, 23325, 29493, 1678, 1228, 1136, 29572, 4, 1083, 29510, 29487, 3316, 2366, 29491, 2370, 1309, 1083, 2084, 1136, 3922, 29572, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
@pcuenca
Perhaps the
expected
is incorrect?
Hi @maiqingqiang! The default behaviour of the tokenizer is to add a bos
token, so it indeed prepends a 1
(which is the bos token id in this tokenizer) when you tokenize a string manually.
However, apply_chat_template
never adds special tokens to give full control to the template. This is the Python code I tested, and the actual tokenized input that is expected by the model:
>>> from transformers import AutoTokenizer
>>> t = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.3")
>>> messages = [ {"role": "user", "content": "Hello, how are you?"}, {"role": "assistant", "content": "I'm doing great. How can I help you today?"} ]
>>> t.apply_chat_template(messages)
[1, 3, 23325, 29493, 1678, 1228, 1136, 29572, 4, 1083, 29510, 29487, 3316, 2366, 29491, 2370, 1309, 1083, 2084, 1136, 3922, 29572, 2]
To achieve this in Swift, we need to add support for an additional parameter to override the default behaviour when necessary. I have a local branch with the changes that I'll push shortly :)
Perhaps the
expected
is incorrect?Hi @maiqingqiang! The default behaviour of the tokenizer is to add a
bos
token, so it indeed prepends a1
(which is the bos token id in this tokenizer) when you tokenize a string manually.However,
apply_chat_template
never adds special tokens to give full control to the template. This is the Python code I tested, and the actual tokenized input that is expected by the model:>>> from transformers import AutoTokenizer >>> t = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.3") >>> messages = [ {"role": "user", "content": "Hello, how are you?"}, {"role": "assistant", "content": "I'm doing great. How can I help you today?"} ] >>> t.apply_chat_template(messages) [1, 3, 23325, 29493, 1678, 1228, 1136, 29572, 4, 1083, 29510, 29487, 3316, 2366, 29491, 2370, 1309, 1083, 2084, 1136, 3922, 29572, 2]
To achieve this in Swift, we need to add support for an additional parameter to override the default behaviour when necessary. I have a local branch with the changes that I'll push shortly :)
I seeπ
Move #104
I want to port https://github.com/huggingface/transformers/blob/838b87abe231fd70be5132088d0dee72a7bb8d62/src/transformers/tokenization_utils_base.py#L1693
https://huggingface.co/docs/transformers/chat_templating#introduction
But I'm not sure if this implementation is correct, and I hope you can give me some advice. If the idea is correct, I will continue to develop.
@pcuenca