Closed alexcardo closed 12 months ago
Hi @alexcardo , thanks for you question.
So, basically I would recommend to learn more about tiny-lms https://github.com/karpathy/llama2.c
LLAMA is just an architecture based on transformers.
llama.cpp
is an implementation of LLAMA architecture inference that's goal is to inference the models on consumers hardware via quantization and exporting original llama weights to ggml/gguf, that's is a seprate format of storing weights. So it's not compatible with llama2 at the moment. Also the quantized models are not compatible with llama2.c
.
So essentially currently our goal is to not implement full fledged inference of the original or even quantized models. I would say for now the purpose and interest is pure academic learnings
See the quote from Karpathy:
Compared to llama.cpp, I wanted something super simple, minimal, and educational so I chose to hard-code the Llama 2 architecture and just roll one inference file of pure C with no dependencies.
Ok, I installed mojo, cloned your repo and run the test. It works, congrats! But how all of this relates to LLAMA? Nothing happened when I was trying to run the LLAMA2 itself:
alex@NLDW4-5-20-11:~/ai/llama2.mojo$ mojo llama2.mojo ~/ai/llama.cpp/models/ggml-model-q4_1.bin -s 100 -n 256 -t 0.5 -i "Llama is an animal" num hardware threads: 12 SIMD vector width: 16 checkpoint size: 4238459520 Killed alex@NLDW4-5-20-11:~/ai/llama2.mojo$ mojo llama2.mojo ~/ai/llama.cpp/models/ggml-model-q4_1.bin -s 100 -n 256 -t 4 -i "Llama is an animal" num hardware threads: 12 SIMD vector width: 16 checkpoint size: 4238459520 Killed
I don't know what does it mean -t 0.5 (I suppose threads), I've been trying -t 4 and again without results.
The the clue here is how to run LLAMA 2 using this new language called MOJO. And if you made a MOJO wrapper for the LLAMA/LLAMA2 models, please provide the instruction on how to run the model using this wrapper.
Thank you.