Closed galleon closed 10 months ago
This looks about right. The input expects to be formatted like:
<|im_start|>user
your prompt here<|im_end|>
<|im_start|>assistant
it also says "The dataset used is OpenAssistant/oasst_top1_2023-08-25 following the chatml format." in the model card
May be there is a misunderstanding on my side but does the gguf
format contain information about the format of the prompt - or shall I provide myself in --prompt
the well formed prompt -- which I did not do above.
I would also expect llama.cpp
to stop after (and remove ?) the first generated |im_end|
. Isn't it ?
Finally, any reason why the Q5_K model is answering in spanish :-)
Thanks
no, gguf does not contain any info on prompt creation. (atleast not yet)
with llama.cpp you can use the -i
option, together with
--in-prefix "<|im_start|>user\n" --in-suffix "<|im_end|>\n<|im_start|>assistant\n"
, as well as
-b "<|im_start|>user"
so it gives back the control when it predicts another user input.
@galleon, you have to enter the format, and I'm not sure if the problem still persists, but in #20 I mentioned how the v2 model can do spanish gibberish if prompted wrong.
@VatsaDev in my case the Spanish isn’t completely gibberish. Using Google translate gives something related to my prompt ;-)
Ah, I am based off version two, and version V3 is better, and google translate also does its best to try to make the text mean things, it warps french like crazy!
but anyway @galleon, did the old format help?
@VatsaDev yes having this command:
./main -m ~/.cache/llama.cpp/models/TinyLlama-1.1B-Chat-v0.3.Q5_K.gguf -p "<|im_start|>user\nProvide a one-sentence definition of a Large Language Model.<|im_end|>\n<|im_start|>assistant\n" --n-gpu-layers 0 --seed 42 --escape --color
led to correct answers for both F32
and Q5_K
(F32 is better though - no surprise there at this stage)
When as suggested by @Green-Sky I try to move to something like:
./main -m ~/.cache/llama.cpp/models/TinyLlama-1.1B-Chat-v0.3.F32.gguf -i --in-prefix "<|im_start|>user\n" --in-suffix "<|im_end|>\n<|im_start|>assistant\n" --n-gpu-layers 0 --seed 42 --escape --color
I am facing this error:
GGML_ASSERT: llama.cpp:7756: false
[1] 48058 abort ./main --model ~/.cache/llama.cpp/models/TinyLlama-1.1B-Chat-v0.3.F32.gguf
still investigating ...
@galleon are you using an gpu enabled build? bc f32 is not supported on gpu. try converting to f16 or q8_0
@Green-Sky I am working on an old 2020 13-inch MBP, Quad-Core Intel Core i7 with Intel Iris Plus Graphics 1536 MB and it looks that Metal is not supported ... but I have now rebuilt llama.cpp with make LLAMA_NO_METAL=1
and (even though I keep it, the --n-gpu-layers
option is ignored as stated as the start of llama.cpp
➜ llama.cpp git:(master) ✗ ./main -m ~/.cache/llama.cpp/models/TinyLlama-1.1B-Chat-v0.3.F32.gguf -i --in-prefix "<|im_start|>user\n" --in-suffix "<|im_end|>\n<|im_start|>assistant\n" --n-gpu-layers 0 --seed 42 --escape --color -t 3 [🐍 llm_in_action]
warning: not compiled with GPU offload support, --n-gpu-layers option will be ignored
warning: see main README.md for information on enabling GPU BLAS support
Log start
main: build = 1317 (79f34ab)
main: built with Apple clang version 15.0.0 (clang-1500.0.40.1) for x86_64-apple-darwin22.6.0
main: seed = 42
old 2020
lmao
and it looks that Metal is not supported
hm, yea, might be the case. the new unified memory m1/m2 chips are sooo much faster too.
keep in mind that f32 inference in llama.cpp is mostly untested, because most famous models dont gain anything from higher precision, if they even exist in higher precision (than f16), and experimentally, we have seen that the difference between f16 and q8_0 is negligible.
Hi,
Posting here even though this is not related to the code itself.
Context: I have tried to used
Chat-v0.3
directly using the checpoints [code]() I just addedeos_token_id=tokenizer.eos_token_id
to the example to make it finish as expected.I obtain an answer that I consider ok even though it is made of three sentences (I have not looked into the details on how you generated the chat version. Any info avail ?)
Then I decided to move to
llama.cpp
making sure to update my version to get the fix for the issue you recently ran into.I did generate the
F32
version (which should be the same as the checkpoint). Here is the result I got to this CLI./main -m ~/.cache/llama.cpp/models/TinyLlama-1.1B-Chat-v0.3.gguf -p "Please answer in one sentence to this question: What is a Large Language Model?" --n-gpu-layers 0 --temp 0 --escape --seed 42 --color --n-predict -2
Do you know why it continue to generate after the EOS ?
Then I moved to Q5_K quntized version and get the following output
Completely AWOL which make me consider I have done something wrong. Did someone have similar issues ?