Closed d-kleine closed 1 month ago
Good catch. I think the rope_theta
change from 50_000
to 500_000
caused it: https://github.com/rasbt/LLMs-from-scratch/commit/534a7043642777205817edd7350ad0ea653fc03c. I don't know why this would happen though. The original model config says is 500_000
: https://huggingface.co/meta-llama/Llama-3.2-1B/blob/main/config.json
The unit tests also implement the RoPE with 500_000
and show they are equivalent, so I am not sure why the model works fine with 50_000
.
Oh, wow! I think the theta base must be changed due to the different context length. In the notebook, we are decreasing it to LLAMA32_CONFIG["context_length"] = 8192
for running the model locally, but the original model is set for
LLAMA32_CONFIG = {
"context_length": 131_072, # Context length
...
"rope_base": 500_000, # The base in RoPE's "theta"
I think the theta base must be scaled for the smaller context size: https://gradient.ai/blog/scaling-rotational-embeddings-for-long-context-language-models
Oh good call, I totally forgot about the manual resizing. Rescaling it proportionally seems to do the trick: https://app.reviewnb.com/rasbt/LLMs-from-scratch/pull/419/
Thanks for the article, I have to read it carefully some time to see if there is something else that needs to be done.
BTW it is working in the converting-llama2-to-llama3.ipynb notebook, that uses the text generation model. But in the standalone-llama32.ipynb uses the instruction-finetuned model.
So the link for the config would be the one for the instruction-finetuned Llama 3.2 1b model: https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct/blob/main/config.json
I just have exchanged meta-llama/Llama-3.2-1B-Instruct
with meta-llama/Llama-3.2-1B
in standalone-llama32.ipynb, and the latter works.
Oh good call, I totally forgot about the manual resizing.
But why would you need to rescale theta for Llama-3.2-1B-Instruct
model if it works for Llama-3.2-1B
?
I think the fact that it works for the regular Llama-3.2-1B
model is luck; the model is robust against some numerical changes, that's why things like precision changes and quantization mostly works ok.
Okay, good to know. Learned a lot of new stuff today - thanks for fixing the issue!
I have an issue with the Llama 3.2 notebook. This uses the instruction-finetuned version of LLama 3.2. For the 1B model, the current code as provided on GitHub produces this output for me:
but it should be
I have also run the 3B model which surprisingly worked as it should (I actually think it would be better to remove the 3B from the code to keep it short and simple).
I have tried to fix this issue, but could not find the root of the problem. Up to this state, everything worked fine. I suspect that the paths changes might be the problem, but I could not find the culprit(s).
Do you have any idea?