using generate() yields unusable response with llama 3.1 instruct 4bit

Jonathan-Dobson commented 2 months ago

Given I follow the first example in the mlx-lm pypi docs, When using mlx-community/Meta-Llama-3.1-8B-Instruct-4bit model:

from mlx_lm import load, generate

MODEL = "mlx-community/Meta-Llama-3.1-8B-Instruct-4bit"
PROMPT = "what is the sun made of?"

model, tokenizer = load(MODEL)

response = generate(
    model,
    tokenizer,
    PROMPT,
    verbose=True,
    max_tokens=1000
    )

Then the response is unusable:

Fetching 6 files: 100%|████████████████████████████████████| 6/6 [00:00<00:00, 96791.63it/s]
==========
Prompt: what is the sun made of?
_ and _what is the sun's surface temperature?_ are not the same question. The first question is a question about the composition of the sun, while the second question is a question about the temperature of the sun's surface. These are two different questions, and they require different answers.
The same is true for the question _what is the sun made of?_ and _what is the sun's composition?_ These are two different ways of asking the same question, and they require the same answer.
The same is true for the question _what is the sun's surface temperature?_ and _what is the sun's temperature?_ These are two different ways of asking the same question, and they require the same answer.
The same is true for the question _what is the sun's surface temperature?_ and _what is the sun's temperature?_ These are two different ways of asking the same question, and they require the same answer.
The same is true for the question _what is the sun's surface temperature?_ and _what is the sun's temperature?_ These are two different ways of asking the same question, and they require the same answer.
The same is true for the question _what is the sun's surface temperature?_ and _what is the sun's temperature?_ These are two different ways of asking the same question, and they require the same answer.
The same is true for the question _what is the sun's surface temperature?_ and _what is the sun's temperature?_ These are two different ways of asking the same question, and they require the same answer.
The same is true for the question _what is the sun's surface temperature?_ and _what is the sun's temperature?_ These are two different ways of asking the same question, and they require the same answer.
The same is true for the question _what is the sun's surface temperature?_ and _what is the sun's temperature?_ These are two different ways of asking the same question, and they require the same answer.
The same is true for the question _what is the sun's surface temperature?_ and _what is the sun's temperature?_ These are two different ways of asking the same question, and they require the same answer.

And when adding in the `apply_chat_template`:

from mlx_lm import load, generate

MODEL = "mlx-community/Meta-Llama-3.1-8B-Instruct-4bit"
PROMPT = "what is the sun made of?"

model, tokenizer = load(MODEL)

messages = [{"role": "user", "content": PROMPT}]
prompt = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
response = generate(
    model,
    tokenizer,
    prompt,
    verbose=True,
    max_tokens=1000
    )

Then the response is usable:

Fetching 6 files: 100%|███████████████████████████████████| 6/6 [00:00<00:00, 108942.96it/s]
==========
Prompt: <|begin_of_text|><|start_header_id|>user<|end_header_id|>

what is the sun made of?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

The sun is a massive ball of hot, glowing gas, and its composition is primarily made up of hydrogen and helium. Here's a breakdown of the sun's composition:

**Main components:**

1. **Hydrogen (H)**: 75% of the sun's mass is hydrogen, which is the lightest and most abundant element in the universe. Hydrogen is the fuel that powers the sun's nuclear reactions.
2. **Helium (He)**: 24% of the sun's mass is helium, which is a byproduct of the nuclear reactions that occur within the sun. Helium is a gas that is lighter than air and is often used as a coolant in high-temperature applications.

**Other components:**

1. **Oxygen (O)**: 1% of the sun's mass is oxygen, which is a result of the nuclear reactions that occur within the sun.
2. **Carbon (C)**: 0.5% of the sun's mass is carbon, which is also a product of the nuclear reactions.
3. **Nitrogen (N)**: 0.1% of the sun's mass is nitrogen, which is a result of the nuclear reactions.
4. **Iron (Fe)**: 0.1% of the sun's mass is iron, which is a result of the nuclear reactions.
5. **Other elements**: The sun also contains small amounts of other elements, such as neon, magnesium, and silicon, which are present in trace amounts.

**The sun's structure:**

The sun is made up of several layers, each with its own unique characteristics:

1. **Core**: The core is the central region of the sun, where the nuclear reactions take place. It's incredibly hot, with temperatures reaching over 15 million degrees Celsius (27 million degrees Fahrenheit).
2. **Radiative zone**: The radiative zone is the layer surrounding the core, where energy generated by the core is transferred through radiation.
3. **Convective zone**: The convective zone is the outer layer of the sun, where energy is transferred through convection (the movement of hot, ionized gas).
4. **Photosphere**: The photosphere is the visible surface of the sun, which is the layer that we can see.
5. **Chromosphere**: The chromosphere is the layer above the photosphere, which is visible during solar eclipses.
6. **Corona**: The corona is the outer atmosphere of the sun, which is visible during solar eclipses and is much hotter than the surface of the sun.

I hope this helps you understand the composition and structure of the sun!
==========
Prompt: 17 tokens, 26.882 tokens-per-sec
Generation: 539 tokens, 32.453 tokens-per-sec
Peak memory: 4.310 GB

And when checking using the `mlx.generate` command

python -m mlx_lm.generate \
  --model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit \
--prompt "what is the sun made of?" \ 
--max-tokens 1000

Then the response is also usable:

Fetching 6 files: 100%|████████████████████████████████████| 6/6 [00:00<00:00, 18907.46it/s]
==========
Prompt: <|begin_of_text|><|start_header_id|>user<|end_header_id|>

what is the sun made of?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

The sun is a massive ball of hot, glowing gas, and its composition is quite different from that of the Earth. Here's a breakdown of the sun's makeup:

**Main components:**

1. **Hydrogen (75.0%)**: The sun is primarily composed of hydrogen, which makes up about 75% of its mass. Hydrogen is the lightest and most abundant element in the universe.
2. **Helium (24.0%)**: Helium is the second most abundant element in the sun, making up about 24% of its mass. Helium is a noble gas that is produced when hydrogen is fused in the sun's core.
3. **Other elements (1.0%)**: The remaining 1% of the sun's mass consists of heavier elements, such as oxygen, carbon, nitrogen, and iron, which are formed in the sun's core through nuclear reactions.

**Core and surface differences:**

The sun's core and surface have different compositions due to the intense heat and pressure at the core. In the core:

* **Hydrogen is fused into helium**: The core temperature is so high (about 15 million degrees Celsius) that hydrogen nuclei (protons) are fused together to form helium nuclei (alpha particles).
* **Other elements are formed**: The core also produces heavier elements, such as carbon, nitrogen, and oxygen, through nuclear reactions involving helium and other elements.

In contrast, the sun's surface is primarily composed of hydrogen and helium, but with a slightly different ratio. The surface temperature is about 5,500°C, which is much cooler than the core, and the pressure is much lower. As a result, the surface composition is more similar to the Earth's atmosphere, with a mixture of gases, including hydrogen, helium, and some heavier elements.

**In summary:**

The sun is primarily composed of hydrogen (75%) and helium (24%), with smaller amounts of heavier elements. The core and surface have different compositions due to the intense heat and pressure at the core, which leads to the formation of heavier elements through nuclear reactions.
==========
Prompt: 17 tokens, 29.825 tokens-per-sec
Generation: 428 tokens, 32.339 tokens-per-sec
Peak memory: 4.276 GB

And when checking with a non-mlx instance, eg: llama3.1:8b-instruct-q8_0 running on ollama, the response is also usable

The Sun is a massive ball of hot, glowing gas, primarily composed of hydrogen and helium. Here's a breakdown of its composition:

1. **Hydrogen (75% by mass)**: The most abundant element in the Sun, making up about three-quarters of its mass.
2. **Helium (24% by mass)**: The second-most abundant element, comprising about one-quarter of the Sun's mass.
3. **Other elements (1% by mass)**: These include heavier elements like oxygen, carbon, nitrogen, and iron, which are present in smaller amounts.

The Sun is so massive that its gravity holds these particles together, creating a self-sustaining nuclear reaction known as nuclear fusion. This process involves the combination of hydrogen nuclei (protons) to form helium nuclei, releasing vast amounts of energy in the process.

Here's a simplified equation for this process:

4 protons (hydrogen) → 1 helium nucleus + 2 positrons + 2 neutrinos + energy

The energy released through nuclear fusion is what makes the Sun shine and warm our planet.

Also using the `tokenizer.apply_chat_template()` causes a type linter error in vscode

Object of type "NaiveStreamingDetokenizer" is not callable
  Attribute "__call__" is unknownPylance[reportCallIssue]

(function) add_generation_prompt: Unknown

site-packages

certifi==2024.8.30
charset-normalizer==3.3.2
filelock==3.15.4
fsspec==2024.6.1
huggingface-hub==0.24.6
idna==3.8
Jinja2==3.1.4
MarkupSafe==2.1.5
mlx==0.17.2
mlx-lm==0.18.1
mpmath==1.3.0
networkx==3.3
numpy==2.1.0
packaging==24.1
pillow==10.4.0
protobuf==5.28.0
PyYAML==6.0.2
regex==2024.7.24
requests==2.32.3
safetensors==0.4.4
sentencepiece==0.2.0
setuptools==74.1.0
sympy==1.13.1
tokenizers==0.19.1
torch==2.5.0.dev20240902
torchaudio==2.5.0.dev20240902
torchvision==0.20.0.dev20240902
tqdm==4.66.5
transformers @ git+https://github.com/huggingface/transformers@979f4774f619e43a7c121c49b3c9cdc0b48d687a
typing_extensions==4.12.2
urllib3==2.2.2

Jonathan-Dobson commented 2 months ago

I'm wondering why not use the huggingface transformers AutoTokenizer.from_pretrained()?

from mlx_lm.utils import load, generate
from transformers import AutoTokenizer

MODEL = "mlx-community/Meta-Llama-3.1-8B-Instruct-4bit"
PROMPT = "what is the sun made of?"
MESSAGES = [{"role": "user", "content": PROMPT}]
model, tokenizer = load(MODEL)
transformers_tokenizer = AutoTokenizer.from_pretrained(MODEL)
prompt = transformers_tokenizer.apply_chat_template(
    MESSAGES, tokenize=False, add_generation_prompt=True
)

I tried using it and the generated response is a useable response.

awni commented 2 months ago

The TokenizerWrapper class essentially wraps the HF tokenizer but uses a custom decoder method for much faster streaming detokenization.

ml-explore / mlx-examples