tairov / llama2.mojo

Inference Llama 2 in one file of pure 🔥
https://www.modular.com/blog/community-spotlight-how-i-built-llama2-by-aydyn-tairov
MIT License
2.09k stars 140 forks source link

How it relates to LLAMA? #22

Closed alexcardo closed 12 months ago

alexcardo commented 12 months ago

Ok, I installed mojo, cloned your repo and run the test. It works, congrats! But how all of this relates to LLAMA? Nothing happened when I was trying to run the LLAMA2 itself:

alex@NLDW4-5-20-11:~/ai/llama2.mojo$ mojo llama2.mojo ~/ai/llama.cpp/models/ggml-model-q4_1.bin -s 100 -n 256 -t 0.5 -i "Llama is an animal" num hardware threads: 12 SIMD vector width: 16 checkpoint size: 4238459520 Killed alex@NLDW4-5-20-11:~/ai/llama2.mojo$ mojo llama2.mojo ~/ai/llama.cpp/models/ggml-model-q4_1.bin -s 100 -n 256 -t 4 -i "Llama is an animal" num hardware threads: 12 SIMD vector width: 16 checkpoint size: 4238459520 Killed

I don't know what does it mean -t 0.5 (I suppose threads), I've been trying -t 4 and again without results.

The the clue here is how to run LLAMA 2 using this new language called MOJO. And if you made a MOJO wrapper for the LLAMA/LLAMA2 models, please provide the instruction on how to run the model using this wrapper.

Thank you.

tairov commented 12 months ago

Hi @alexcardo , thanks for you question.

So, basically I would recommend to learn more about tiny-lms https://github.com/karpathy/llama2.c

LLAMA is just an architecture based on transformers. llama.cpp is an implementation of LLAMA architecture inference that's goal is to inference the models on consumers hardware via quantization and exporting original llama weights to ggml/gguf, that's is a seprate format of storing weights. So it's not compatible with llama2 at the moment. Also the quantized models are not compatible with llama2.c.

So essentially currently our goal is to not implement full fledged inference of the original or even quantized models. I would say for now the purpose and interest is pure academic learnings

See the quote from Karpathy:

Compared to llama.cpp, I wanted something super simple, minimal, and educational so I chose to hard-code the Llama 2 architecture and just roll one inference file of pure C with no dependencies.