Running on CPU using llama.cpp

galleon commented 11 months ago

Hi,

Posting here even though this is not related to the code itself.

Context: I have tried to used Chat-v0.3 directly using the checpoints [code]() I just added eos_token_id=tokenizer.eos_token_id to the example to make it finish as expected.

I obtain an answer that I consider ok even though it is made of three sentences (I have not looked into the details on how you generated the chat version. Any info avail ?)

Then I decided to move to llama.cpp making sure to update my version to get the fix for the issue you recently ran into.

I did generate the F32 version (which should be the same as the checkpoint). Here is the result I got to this CLI ./main -m ~/.cache/llama.cpp/models/TinyLlama-1.1B-Chat-v0.3.gguf -p "Please answer in one sentence to this question: What is a Large Language Model?" --n-gpu-layers 0 --temp 0 --escape --seed 42 --color --n-predict -2

Do you know why it continue to generate after the EOS ?

Then I moved to Q5_K quntized version and get the following output

Completely AWOL which make me consider I have done something wrong. Did someone have similar issues ?

Green-Sky commented 11 months ago

This looks about right. The input expects to be formatted like:

<|im_start|>user
your prompt here<|im_end|>
<|im_start|>assistant

it also says "The dataset used is OpenAssistant/oasst_top1_2023-08-25 following the chatml format." in the model card

galleon commented 11 months ago

May be there is a misunderstanding on my side but does the gguf format contain information about the format of the prompt - or shall I provide myself in --prompt the well formed prompt -- which I did not do above.

I would also expect llama.cpp to stop after (and remove ?) the first generated |im_end|. Isn't it ?

Finally, any reason why the Q5_K model is answering in spanish :-)

Thanks

Green-Sky commented 11 months ago

no, gguf does not contain any info on prompt creation. (atleast not yet)

VatsaDev commented 11 months ago

@galleon, you have to enter the format, and I'm not sure if the problem still persists, but in #20 I mentioned how the v2 model can do spanish gibberish if prompted wrong.

galleon commented 11 months ago

@VatsaDev in my case the Spanish isn’t completely gibberish. Using Google translate gives something related to my prompt ;-)

VatsaDev commented 11 months ago

Ah, I am based off version two, and version V3 is better, and google translate also does its best to try to make the text mean things, it warps french like crazy!

but anyway @galleon, did the old format help?

galleon commented 11 months ago

led to correct answers for both F32 and Q5_K (F32 is better though - no surprise there at this stage)

I am facing this error:

GGML_ASSERT: llama.cpp:7756: false
[1]    48058 abort      ./main --model ~/.cache/llama.cpp/models/TinyLlama-1.1B-Chat-v0.3.F32.gguf

still investigating ...

Green-Sky commented 11 months ago

@galleon are you using an gpu enabled build? bc f32 is not supported on gpu. try converting to f16 or q8_0

galleon commented 11 months ago

@Green-Sky I am working on an old 2020 13-inch MBP, Quad-Core Intel Core i7 with Intel Iris Plus Graphics 1536 MB and it looks that Metal is not supported ... but I have now rebuilt llama.cpp with make LLAMA_NO_METAL=1 and (even though I keep it, the --n-gpu-layers option is ignored as stated as the start of llama.cpp

➜  llama.cpp git:(master) ✗ ./main -m ~/.cache/llama.cpp/models/TinyLlama-1.1B-Chat-v0.3.F32.gguf -i --in-prefix "<|im_start|>user\n" --in-suffix "<|im_end|>\n<|im_start|>assistant\n" --n-gpu-layers 0 --seed 42 --escape --color -t 3                                                          [🐍 llm_in_action]
warning: not compiled with GPU offload support, --n-gpu-layers option will be ignored
warning: see main README.md for information on enabling GPU BLAS support
Log start
main: build = 1317 (79f34ab)
main: built with Apple clang version 15.0.0 (clang-1500.0.40.1) for x86_64-apple-darwin22.6.0
main: seed  = 42

Green-Sky commented 11 months ago

old 2020

lmao

and it looks that Metal is not supported

hm, yea, might be the case. the new unified memory m1/m2 chips are sooo much faster too.

keep in mind that f32 inference in llama.cpp is mostly untested, because most famous models dont gain anything from higher precision, if they even exist in higher precision (than f16), and experimentally, we have seen that the difference between f16 and q8_0 is negligible.

jzhang38 / TinyLlama

Running on CPU using llama.cpp #53