Closed Ezyweb-uk closed 11 months ago
From your answer here I think the answer is that it doesn't work with gguf models.
Hi @Ezyweb-uk , thanks for you question.
You're correct, at the moment llama2.mojo supports tinylama models based on GQA.
In this issue #27 we're discussing some discrepancies & changes in tokenizer so that it can run tinyllama-1.1B
I think GGUF models must be somehow converted to llama2.c format. Didn't have time to discover this topic. I saw on llama.cpp there are converter that transform llama2.c
model into GGUF, maybe it can be used for reverse conversion..
I found this interesting project via the 'AI Anywhere' channel on YouTube. I've installed Modular and Mojo, and successfully run your test on an under powered mini computer with only a 1.5GHz 4 core Intel Celeron cpu, running Ubuntu 20.04.6, and this achieved 32.5 tok/s.
I'm an LLM newbie so my questions may appear stupid!! Can this project be run with other models?
I tried the following:
mojo llama2.mojo /home/ezyweb/Public/chatpdf1/models/llama-2-7b-chat.Q4_K_M.gguf -s 100 -n 256 -t 0.5 -i "What is Llama 2"
And got the result:
num hardware threads: 4 SIMD vector width: 8 checkpoint size: 4081004224 [ 3891 MB ] Killed
Is that likely an under resourced hardware issue or is the project not compatible with .gguf models?