Model doesn't know when to stop generating.

ahmetkca commented 3 months ago

I am relatively new to running inference on my own. Previously, I used ollama, but recently I decided to try out mlx since I have an M3 with sufficient unified memory and I was curious about how it compares to llama.cpp in terms of speed.

I have been trying to run phi3-128k-instruct. I converted the model to an mlx compatible format myself and uploaded it under my hf repository.

Microsoft doesn't provide as extensive an explanation of how to format chat prompts and utilize special tokens with their models, unlike Meta's llama3 models, which are well-documented (e.g., https://llama.meta.com/docs/model-cards-and-prompt-formats/meta-llama-3).

Here is the code snippet I am using for inference:

from mlx_lm import generate, load
from typing import List, Optional

def convert_chat(messages: List[dict], role_mapping: Optional[dict] = None) -> str:
    if len(messages) > 0 and messages[-1].get("role") != "user":
        raise ValueError("The last role in the messages should be 'user'.")

    default_role_mapping = {
        # "system_prompt": "A chat between a curious user and an artificial intelligence assistant. The assistant follows the given rules no matter what.",
        "system": "\n<|system|>\n",
        "user": "<|user|>\n",
        "assistant": "\n<|assistant|>\n",
        "stop": "<|end|>",
    }
    role_mapping = role_mapping if role_mapping is not None else default_role_mapping

    prompt: str = ""
    for i, line in enumerate(messages):
        role_prefix = role_mapping.get(line["role"], "")
        stop = role_mapping.get("stop", "")
        content = line.get("content", "")
        if i > 0 and role_prefix == role_mapping["user"]:
            role_prefix = "\n" + role_prefix
        prompt += f"{role_prefix}<s>{content}</s>{stop}"

    prompt += role_mapping.get("assistant", "")
    return prompt.rstrip()

if __name__ == "__main__":
    model, tokenizer = load(
        path_or_hf_repo="ahmetkca/Phi-3-mini-128k-instruct-mlx",
    )

    user_first_prompt = """I am going to Paris, give me a list of 10 places to visit"""

    messages = [
        {"role": "user", "content": user_first_prompt},
    ]

    prompt = convert_chat(messages)

    try:
        print(prompt)
        input("Press Enter to continue or Ctrl+C to stop...")
    except KeyboardInterrupt:
        exit()

    response = generate(
        model=model,
        tokenizer=tokenizer,
        prompt=prompt,
        max_tokens=1024//1,
        temp=0.5,)

    print(response)

This issue may not be directly related to mlx, but I need assistance with properly formatting prompts and using special tokens. I have tried running phi3 on HuggingChat, and there is a notable difference in the outputs. The responses from HuggingChat are significantly better compared to when I run the model locally with mlx. I would appreciate any guidance or recommendations on what I might be doing wrong.

Here is the response I am getting:

❯ python -m main
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
Fetching 10 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 135300.13it/s]
[WARNING] rope_scaling 'type' currently only supports 'linear' setting rope scaling to false.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
<|user|>
<s>I am going to Paris, give me a list of 10 places to visit</s><|end|>
<|assistant|>
Press Enter to continue or Ctrl+C to stop...
1. Eiffel Tower: No trip to Paris is complete without a visit to this iconic monument.

2. Louvre Museum: Home to thousands of works of art, including the Mona Lisa.

3. Notre-Dame Cathedral: This Gothic masterpiece is a must-see.

4. Sacré-Cœur Basilica: Perched on the highest point in the city, this Basilica offers a breathtaking view of Paris.

5. Montmartre: This historic and bohemian district is famous for its artistic history and the white-domed Basilica of the Sacré-Cœur.

6. Palace of Versailles: A short trip from Paris, this opulent palace is a testament to the absolute monarchy of the Ancien Régime.

7. Seine River Cruise: A relaxing boat ride along the Seine River gives you a unique perspective of Paris.

8. Champs-Élysées: This famous avenue is known for its theatres, cafes, and luxury shops.

9. Sainte-Chapelle: This stunning gothic chapel is renowned for its magnificent stained glass windows.

10. The Centre Pompidou: A high-tech architectural marvel housing the National Museum of Modern Art.<|end|><|assistant|> 1. The Palace of Versailles: This opulent palace is a testament to the absolute monarchy of the Ancien Régime.

2. The Eiffel Tower: No trip to Paris is complete without a visit to this iconic monument.

3. The Louvre Museum: Home to thousands of works of art, including the Mona Lisa.

4. Notre-Dame Cathedral: This Gothic masterpiece is a must-see.

5. Sacré-Cœur Basilica: Perched on the highest point in the city, this Basilica offers a breathtaking view of Paris.

6. Montmartre: This historic and bohemian district is famous for its artistic history and the white-domed Basilica of the Sacré-Cœur.

7. The Seine River Cruise: A relaxing boat ride along the Seine River gives you a unique perspective of Paris.

8. The Champs-Élysées: This famous avenue is known for its theatres, cafes, and luxury shops.

9. Sainte-Chapelle: This stunning gothic chapel is renowned for its magnificent stained glass windows.

10. The Centre Pompidou: A high-tech architectural marvel housing the National Museum of Modern Art.<|end|><|assistant|> 1. The Palace of Versailles: This opulent palace is a testament to the absolute monarchy of the Ancien Régime.

2. The Eiffel Tower: No trip to Paris is complete without a visit to this iconic monument.

3. The Louvre Museum: Home to thousands of works of art, including the Mona Lisa.

4. Notre-Dame Cathedral: This Gothic masterpiece is a must-see.

5. Sacré-Cœur Basilica: Perched on the highest point in the city, this Basilica offers a breathtaking view of Paris.

6. Montmartre: This historic and bohemian district is famous for its artistic history and the white-domed Basilica of the Sacré-Cœur.

7. The Seine River Cruise: A relaxing boat ride along the Seine River gives you a unique perspective of Paris.

8. The Champs-Élysées: This famous avenue is known for its theatres, cafes, and luxury shops.

9. Sainte-Chapelle: This stunning gothic chapel is renowned for its magnificent stained glass windows.

10. The Centre Pompidou: A high-tech architectural marvel housing the National Museum of Modern Art.<|end|><|assistant|> 1. Eiffel Tower: This iconic monument is an essential part of the Parisian landscape.

2. Louvre Museum: This museum houses an extensive collection of artworks, including the famous Mona Lisa.

3. Notre-Dame Cathedral: This stunning Gothic cathedral is one of the most visited monuments in Paris.

4. Sacré-Cœur Basilica: This basilica offers a panoramic view of Paris from its highest point in the city.

5. Montmartre: This historic and artistic district

mzbac commented 3 months ago

It has the chat template, you can directly use tokenizer.apply_chat_template instead of doing the role mapping yourself.

prompt = tokenizer.apply_chat_template(
            messages, tokenize=False, add_generation_prompt=True
        )

ahmetkca commented 3 months ago

It has the chat template, you can directly use tokenizer.apply_chat_template instead of doing the role mapping yourself.
prompt = tokenizer.apply_chat_template(
            messages, tokenize=False, add_generation_prompt=True
        )

Thanks a lot! I wouldn't have known about the tokenizer's apply_chat_template method if I hadn't asked here.

Where or how can I learn more about these types of features? Another question I have is: How can I stream the model's response instead of waiting for the entire response to complete?

awni commented 3 months ago

Another question I have is: How can I stream the model's response instead of waiting for the entire response to complete?

If you want to print the streaming output to the console you can pass verbose=True to generate. If you are trying to do something different let us know your use case and maybe can make it work.

ahmetkca commented 3 months ago

Another question I have is: How can I stream the model's response instead of waiting for the entire response to complete?

If you want to print the streaming output to the console you can pass verbose=True to generate. If you are trying to do something different let us know your use case and maybe can make it work.

I have tried using verbose=True. However, I was asking about a streaming method more akin to how OpenAI handles streaming. Currently, I need to wait for the response to finish before I can use it. By the way, would using the tokenizer.apply_chat_template method make the model stop where it should?

mzbac commented 3 months ago

You can take a look at the mlx_lm server's implementation here: https://github.com/ml-explore/mlx-examples/blob/main/llms/mlx_lm/server.py. It only has hundreds of lines of code and is quite self-contained.

For more information, you can also refer to the SERVER.md file here: https://github.com/ml-explore/mlx-examples/blob/main/llms/mlx_lm/SERVER.md.

ml-explore / mlx-examples

Model doesn't know when to stop generating. #745