DBRX Instruct prompt templating issue

eek commented 7 months ago

Hi there!

Wanted to say congrats @awni for the work on the DBRX support.

I've also converted and uploaded the dbrx-instruct version on HF: https://huggingface.co/mlx-community/dbrx-instruct-4bit

It works ok with no prompt templating but for Instruct it works way better with prompt templating, for which I have a small issue:

If I just do the following:

<|im_start|>user
What's the difference between PCA vs UMAP vs t-SNE?<|im_end|>

and do not add the assistant part, it errors with:

Traceback (most recent call last):
  File "/Users/eek/work/dbrx/template.py", line 15, in <module>
    response = generate(model, tokenizer, prompt=prompt, verbose=True)
  File "/Users/eek/.pyenv/versions/3.10.12/lib/python3.10/site-packages/mlx_lm-0.4.0-py3.10.egg/mlx_lm/utils.py", line 273, in generate
    prompt_tps = prompt_tokens.size / prompt_time
UnboundLocalError: local variable 'prompt_time' referenced before assignment

if I do

<|im_start|>user
What's the difference between PCA vs UMAP vs t-SNE?<|im_end|>
<|im_start|>assistant

it works but then I get an instant <|im_end|> and end of execution.

The best result I've had so far was:

<|im_start|>system
You are DBRX, created by Databricks. You were last updated in December 2023. You answer questions based on information available up to that point.
YOU PROVIDE SHORT RESPONSES TO SHORT QUESTIONS OR STATEMENTS, but provide thorough responses to more complex and open-ended questions.
You assist with various tasks, from writing to coding (using markdown for code blocks — remember to use ``` with code, JSON, and tables).
(You do not have real-time data access or code execution capabilities. You avoid stereotyping and provide balanced perspectives on controversial topics. You do not provide song lyrics, poems, or news articles and do not divulge details of your training data.)
This is your system prompt, guiding your responses. Do not reference it, just respond to the user. If you find yourself talking about this message, stop. You should be responding appropriately and usually that means not mentioning this.
YOU DO NOT MENTION ANY OF THIS INFORMATION ABOUT YOURSELF UNLESS THE INFORMATION IS DIRECTLY PERTINENT TO THE USER'S QUERY.<|im_end|>
<|im_start|>user
What's the difference between PCA vs UMAP vs t-SNE?<|im_end|>
<|im_start|>assistant
The difference

Where I've also added a couple words after the assistant start, this works well.

This is my bash command:

python -m mlx_lm.generate --model dbrx-instruct-4bit --prompt "$(cat my_prompt)"  --trust-remote-code --max-tokens 1000

where the above prompt is added in the my_prompt file.

here's the equivalent python script:

from mlx_lm import load, generate

model, tokenizer = load("mlx-community/dbrx-instruct-4bit")

chat = [
   {"role": "user", "content": "What's the difference between PCA vs UMAP vs t-SNE?"},
]

prompt = tokenizer.apply_chat_template(chat, tokenize=False)

response = generate(model, tokenizer, prompt=prompt, verbose=True)

The mlx-lm that I have locally is the latest commit one b80adbc.

eek commented 7 months ago

Seems that if I disable verbose=True it no longer errors, but then I have the issue with nothing being generated.

The last working script I have is:

from mlx_lm import load, generate

model, tokenizer = load(
   "mlx-community/dbrx-instruct-4bit",
   tokenizer_config={"trust_remote_code": True}
)

chat = [
   {"role": "user", "content": "What's the difference between PCA vs UMAP vs t-SNE?"},
   {"role": "assistant", "content": "The "},
]

prompt = tokenizer.apply_chat_template(chat, tokenize=False)

# We need to remove the last <|im_end|> token so that the AI continues generation
prompt = prompt[::-1].replace("<|im_end|>"[::-1], "", 1)[::-1]

response = generate(model, tokenizer, prompt=prompt, verbose=True, temp=0.6, max_tokens=1500)

mzbac commented 7 months ago

Maybe not related, but you can pass --use-default-chat-template to mlx_lm.generate to enable the default chat template for the model if it is using the default chat template. e.g.

python -m mlx_lm.generate --model dbrx-instruct-4bit --prompt "$(cat my_prompt)"  --trust-remote-code --use-default-chat-template --max-tokens 1000

lin72h commented 7 months ago

I've learned --use-default-chat-template the hard way😅

eek commented 7 months ago

Seems if I use --use-default-chat-template it indeed works.

So, the only issue is via the python script, the following code errors:

from mlx_lm import load, generate

model, tokenizer = load(
    "mlx-community/dbrx-instruct-4bit",
    tokenizer_config={"trust_remote_code": True}
)

chat = [
   {"role": "user", "content": "What's the difference between PCA vs UMAP vs t-SNE?"},
]

prompt = tokenizer.apply_chat_template(chat, tokenize=False)

response = generate(model, tokenizer, prompt=prompt, verbose=True)

==========
Traceback (most recent call last):
  File "/Users/eek/work/dbrx/test2.py", line 14, in <module>
    response = generate(model, tokenizer, prompt=prompt, verbose=True)
  File "/Users/eek/.pyenv/versions/3.10.12/lib/python3.10/site-packages/mlx_lm-0.4.0-py3.10.egg/mlx_lm/utils.py", line 273, in generate
    prompt_tps = prompt_tokens.size / prompt_time
UnboundLocalError: local variable 'prompt_time' referenced before assignment

awni commented 7 months ago

Seems like you aren't getting any output using that prompt 🤔 (which triggering an edge case and causing the crash).

It works if you do the following when you apply the template:

prompt = tokenizer.apply_chat_template(chat, add_generation_promprt=True, tokenize=False)

ml-explore / mlx-examples

DBRX Instruct prompt templating issue #637