Closed Jonathan-Dobson closed 2 months ago
I'm wondering why not use the huggingface transformers AutoTokenizer.from_pretrained()?
from mlx_lm.utils import load, generate
from transformers import AutoTokenizer
MODEL = "mlx-community/Meta-Llama-3.1-8B-Instruct-4bit"
PROMPT = "what is the sun made of?"
MESSAGES = [{"role": "user", "content": PROMPT}]
model, tokenizer = load(MODEL)
transformers_tokenizer = AutoTokenizer.from_pretrained(MODEL)
prompt = transformers_tokenizer.apply_chat_template(
MESSAGES, tokenize=False, add_generation_prompt=True
)
I tried using it and the generated response is a useable response.
The TokenizerWrapper
class essentially wraps the HF tokenizer but uses a custom decoder method for much faster streaming detokenization.
Given I follow the first example in the mlx-lm pypi docs, When using
mlx-community/Meta-Llama-3.1-8B-Instruct-4bit
model:Then the response is unusable:
And when adding in the
apply_chat_template
:Then the response is usable:
And when checking using the
mlx.generate
commandThen the response is also usable:
And when checking with a non-mlx instance, eg: llama3.1:8b-instruct-q8_0 running on ollama, the response is also usable
Also using the
tokenizer.apply_chat_template()
causes a type linter error in vscodesite-packages