tairov / llama2.mojo

Inference Llama 2 in one file of pure 🔥
https://www.modular.com/blog/community-spotlight-how-i-built-llama2-by-aydyn-tairov
MIT License
2.09k stars 140 forks source link

Question about models #28

Closed Ezyweb-uk closed 11 months ago

Ezyweb-uk commented 11 months ago

I found this interesting project via the 'AI Anywhere' channel on YouTube. I've installed Modular and Mojo, and successfully run your test on an under powered mini computer with only a 1.5GHz 4 core Intel Celeron cpu, running Ubuntu 20.04.6, and this achieved 32.5 tok/s.

I'm an LLM newbie so my questions may appear stupid!! Can this project be run with other models?

I tried the following: mojo llama2.mojo /home/ezyweb/Public/chatpdf1/models/llama-2-7b-chat.Q4_K_M.gguf -s 100 -n 256 -t 0.5 -i "What is Llama 2"

And got the result: num hardware threads: 4 SIMD vector width: 8 checkpoint size: 4081004224 [ 3891 MB ] Killed

Is that likely an under resourced hardware issue or is the project not compatible with .gguf models?

Ezyweb-uk commented 11 months ago

From your answer here I think the answer is that it doesn't work with gguf models.

tairov commented 11 months ago

Hi @Ezyweb-uk , thanks for you question. You're correct, at the moment llama2.mojo supports tinylama models based on GQA. In this issue #27 we're discussing some discrepancies & changes in tokenizer so that it can run tinyllama-1.1B

I think GGUF models must be somehow converted to llama2.c format. Didn't have time to discover this topic. I saw on llama.cpp there are converter that transform llama2.c model into GGUF, maybe it can be used for reverse conversion..