Open ahmetkca opened 3 months ago
It has the chat template, you can directly use tokenizer.apply_chat_template instead of doing the role mapping yourself.
prompt = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
It has the chat template, you can directly use tokenizer.apply_chat_template instead of doing the role mapping yourself.
prompt = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True )
Thanks a lot! I wouldn't have known about the tokenizer's apply_chat_template
method if I hadn't asked here.
Where or how can I learn more about these types of features? Another question I have is: How can I stream the model's response instead of waiting for the entire response to complete?
Another question I have is: How can I stream the model's response instead of waiting for the entire response to complete?
If you want to print the streaming output to the console you can pass verbose=True
to generate
. If you are trying to do something different let us know your use case and maybe can make it work.
Another question I have is: How can I stream the model's response instead of waiting for the entire response to complete?
If you want to print the streaming output to the console you can pass
verbose=True
togenerate
. If you are trying to do something different let us know your use case and maybe can make it work.
I have tried using verbose=True
. However, I was asking about a streaming method more akin to how OpenAI handles streaming. Currently, I need to wait for the response to finish before I can use it. By the way, would using the tokenizer.apply_chat_template
method make the model stop where it should?
You can take a look at the mlx_lm server's implementation here: https://github.com/ml-explore/mlx-examples/blob/main/llms/mlx_lm/server.py. It only has hundreds of lines of code and is quite self-contained.
For more information, you can also refer to the SERVER.md file here: https://github.com/ml-explore/mlx-examples/blob/main/llms/mlx_lm/SERVER.md.
I am relatively new to running inference on my own. Previously, I used ollama, but recently I decided to try out mlx since I have an M3 with sufficient unified memory and I was curious about how it compares to llama.cpp in terms of speed.
I have been trying to run phi3-128k-instruct. I converted the model to an mlx compatible format myself and uploaded it under my hf repository.
Microsoft doesn't provide as extensive an explanation of how to format chat prompts and utilize special tokens with their models, unlike Meta's llama3 models, which are well-documented (e.g., https://llama.meta.com/docs/model-cards-and-prompt-formats/meta-llama-3).
Here is the code snippet I am using for inference:
This issue may not be directly related to mlx, but I need assistance with properly formatting prompts and using special tokens. I have tried running phi3 on HuggingChat, and there is a notable difference in the outputs. The responses from HuggingChat are significantly better compared to when I run the model locally with mlx. I would appreciate any guidance or recommendations on what I might be doing wrong.
Here is the response I am getting: