tairov / llama2.mojo

Inference Llama 2 in one file of pure 🔥
https://www.modular.com/blog/community-spotlight-how-i-built-llama2-by-aydyn-tairov
MIT License
2.09k stars 140 forks source link

Remove Falcon style ROPE #35

Closed magician-blue closed 11 months ago

magician-blue commented 11 months ago

All HF llama model are falcon style ROPE and we can convert them to original llama style ROPE with a permutation. This pull request solve the bug when converting HF GQA to gguf format. I learned idea from it and fix the similar bug in the llama2.c's exports.py. Now I successfully convert Tinyllama-1.1B-chat to llama style ROPE. So, we can remove the falcon ROPE part. I have upload the new export.py and llama2.mojo.

Details: python export.py tl-chat.bin --hf PY007/TinyLlama-1.1B-Chat-v0.2 --version 0 to convert the model

magician-blue commented 11 months ago

I have updated the model on huggingface.

tairov commented 11 months ago

Hi @magician-blue , so do you mean the tl-chat model on HF is not compatible with this repo anymore ?

magician-blue commented 11 months ago

Hi @magician-blue , so do you mean the tl-chat model on HF is not compatible with this repo anymore ?

@tairov We still can run with our repo.

Change from

mojo llama2.mojo tl-chat.bin \
    -r falcon \
    -z tok_tl-chat.bin \
    -n 256 -t 0 -s 100 -i "<|im_start|>user\nGive me a python function to generate Fibonacci sequence<|im_end|>\n<|im_start|>assistant\n"

to

mojo llama2.mojo tl-chat.bin \
    -r llama \
    -z tok_tl-chat.bin \
    -n 256 -t 0 -s 100 -i "<|im_start|>user\nGive me a python function to generate Fibonacci sequence<|im_end|>\n<|im_start|>assistant\n"
magician-blue commented 11 months ago

If we can convert all HF llama model(they use falcon style rope) to llama style rope. Then we only need to implementone type of rope in our repo. This is what llama2.c and llama.cpp are doing.

tairov commented 11 months ago

Looks cool. Could you share some details where is this convert.py file came from? I see it has some dependencies. Probably we can remove it from the PR, and then keep only link to a converted model in the README file so that the overall process will be simpler?

magician-blue commented 11 months ago

The original convert file comes from llama2.c and I modify some part of it to support GQA. I have already make a pull request to llama2.c, but not merged yet. We can wait for a while.

magician-blue commented 11 months ago

The next thing I will do is to convert openllama3b(12G RAM), llama2-chat-7b(28G RAM), vicuna-7b to test my convertor and our llama2.mojo. Besides, I'll focus on the tokenizer part of llama.cpp and llama2.c in order to find a way to remove the hardcode part of our tokenizer.

tairov commented 11 months ago

In this case I guess the convert.py is not needed in the repo. And it's cool that you have plans to research other types of models support

tairov commented 11 months ago

model could be converted using script from llama2c And for llama2.mojo we have a URL in the readme file

tairov commented 11 months ago

thank you!