rasbt / LLMs-from-scratch

Implement a ChatGPT-like LLM in PyTorch from scratch, step by step
https://www.amazon.com/Build-Large-Language-Model-Scratch/dp/1633437167
Other
34.28k stars 4.2k forks source link

Llama 3.2 standalone #418

Closed d-kleine closed 1 month ago

d-kleine commented 1 month ago

I have an issue with the Llama 3.2 notebook. This uses the instruction-finetuned version of LLama 3.2. For the 1B model, the current code as provided on GitHub produces this output for me:

Output text: <|start_header_id|>user<|end_header_id|>

What do llamas eat?<|eot_id|><|start_header_id|><|start_header_id|><|start_header_id|> ...

but it should be

Output text: Llamas are herbivores, which means they primarily eat plants and plant-based foods. Their diet typically consists of:

  1. Grasses: Llamas love to graze on various types of grasses, including tall grasses and short grasses.
  2. Hay: Llamas also eat hay, which is a dry, compressed form of grass or other plants.
  3. Alfalfa: Alfalfa is a legume that is commonly fed to llamas, as it is high in protein and fiber.
  4. Other plants: Llamas will also eat other plants, such as clover, wild grasses, and shrubs.

It's worth noting that llamas are adapted to high altitudes and

I have also run the 3B model which surprisingly worked as it should (I actually think it would be better to remove the 3B from the code to keep it short and simple).

I have tried to fix this issue, but could not find the root of the problem. Up to this state, everything worked fine. I suspect that the paths changes might be the problem, but I could not find the culprit(s).

Do you have any idea?

rasbt commented 1 month ago

Good catch. I think the rope_theta change from 50_000 to 500_000 caused it: https://github.com/rasbt/LLMs-from-scratch/commit/534a7043642777205817edd7350ad0ea653fc03c. I don't know why this would happen though. The original model config says is 500_000: https://huggingface.co/meta-llama/Llama-3.2-1B/blob/main/config.json

The unit tests also implement the RoPE with 500_000 and show they are equivalent, so I am not sure why the model works fine with 50_000.

d-kleine commented 1 month ago

Oh, wow! I think the theta base must be changed due to the different context length. In the notebook, we are decreasing it to LLAMA32_CONFIG["context_length"] = 8192 for running the model locally, but the original model is set for

LLAMA32_CONFIG = {
    "context_length": 131_072,  # Context length
...
    "rope_base": 500_000,        # The base in RoPE's "theta"

I think the theta base must be scaled for the smaller context size: https://gradient.ai/blog/scaling-rotational-embeddings-for-long-context-language-models

rasbt commented 1 month ago

Oh good call, I totally forgot about the manual resizing. Rescaling it proportionally seems to do the trick: https://app.reviewnb.com/rasbt/LLMs-from-scratch/pull/419/

Thanks for the article, I have to read it carefully some time to see if there is something else that needs to be done.

d-kleine commented 1 month ago

BTW it is working in the converting-llama2-to-llama3.ipynb notebook, that uses the text generation model. But in the standalone-llama32.ipynb uses the instruction-finetuned model.

So the link for the config would be the one for the instruction-finetuned Llama 3.2 1b model: https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct/blob/main/config.json


I just have exchanged meta-llama/Llama-3.2-1B-Instruct with meta-llama/Llama-3.2-1B in standalone-llama32.ipynb, and the latter works.

Oh good call, I totally forgot about the manual resizing.

But why would you need to rescale theta for Llama-3.2-1B-Instruct model if it works for Llama-3.2-1B?

rasbt commented 1 month ago

I think the fact that it works for the regular Llama-3.2-1B model is luck; the model is robust against some numerical changes, that's why things like precision changes and quantization mostly works ok.

d-kleine commented 1 month ago

Okay, good to know. Learned a lot of new stuff today - thanks for fixing the issue!