Open flatsiedatsie opened 1 month ago
I agree it may be due to the outdated version using an older version of llama.cpp, because Llama 1B is available to try on https://github.ngxson.com/wllama/examples/main/dist/ and it's working fine there. I've also tried the 3B model, and it's all good. Let us know if the update solved the issue!
Odd, it didn't solve it.
I tried re-downloading the model itself, but that didn 't help.
Then I tried Firefox for comparison, and actually noticed the same error.
I'm attempting a non-chunked version of the model next.
https://huggingface.co/BoscoTheDog/llama_3_2_it_1b_q4_k_m_chunked -> https://huggingface.co/hugging-quants/Llama-3.2-1B-Instruct-Q4_K_M-GGUF/resolve/main/llama-3.2-1b-instruct-q4_k_m.gguf
bingo
I re-chunked the 1B using the very latest version of llama.cpp.
Now it loads, but only outputs a single word before giving this error:
// Looking back, this error may have just been my code trying to unload Wllama after inference was complete, and failing.
Could you try these splits and confirm if they work? (Those are the ones I'm using without issues on Wllama v1.16.2)
We need first to find out if the problem is with:
{ temp: 0 }
)Here is the chunked model that only outputs one word by the way: https://huggingface.co/BoscoTheDog/llama_3_2_it_1b_q4_k_m_chunked/resolve/main/llama-3_2_it_1b_q4_0-00001-of-00004.gguf
Strange. Now I don't get any output.
I don't think I see it doing that looping action normally. Could it be loading each chunk as if it were the total model?
Setting the sampling to minimal worked!
I thought that allow_offline
might be the issue. But after re-enabling it last, everything still works 0_0.
I keep 'allow_offline' enabled all the time. Is that a bad idea?
I thought that
allow_offline
might be the issue. But after re-enabling it last, everything still works 0_0.I keep 'allow_offline' enabled all the time. Is that a bad idea?
Not at all! I leave it always enabled too! :D
Setting the sampling to minimal worked!
Interesting! Did it work with https://huggingface.co/BoscoTheDog/llama_3_2_it_1b_q4_k_m_chunked/resolve/main/llama-3_2_it_1b_q4_0-00001-of-00004.gguf ?
If so, we can conclude it is something specific with the config passed to Wllama? (If that’s the case, have you found the specific config combination that caused the issue?)
I used to have this enabled all the time too, but I've removed it now.
//model_settings['n_seq_max'] = 1;
//model_settings['n_batch'] = 1024; //2048
But re-enabling it as a test had no (negative) effect.
I still have this enabled:
model_settings['embeddings'] = false;
Should I remove that?
I got the vague notion that .gguf files have template information within them?
Currently I use Transformers.js to turn a conversation dictionary into a templated string, and then feed that into the AI model.
Is there a way that I can skip that step and feed a dictionary of a conversation into Wllama?
[
{
"role": "user",
"content": "How many R's are there in the word strawberry?"
},
{
"role": "assistant",
"content": "There are 2 R's in the word \"strawberry\"."
}
]
Aha! There is a function to get the Jinja template from the GGUF, and then Wllama's uses a dependency on @huggingface/jinja
to apply that template.
Does this error also happen when configuring Wllama with n_threads: 1
? (forcing it to single-thread)
I got the vague notion that .gguf files have template information within them?
Currently I use Transformers.js to turn a conversation dictionary into a templated string, and then feed that into the AI model.
Is there a way that I can skip that step and feed a dictionary of a conversation into Wllama?
[ { "role": "user", "content": "How many R's are there in the word strawberry?" }, { "role": "assistant", "content": "There are 2 R's in the word \"strawberry\"." } ]
There is, by using the @huggingface/jinja package (the same as Transformers.js uses).
Here's the same logic used in https://github.ngxson.com/wllama/examples/main/dist/:
import { Template } from "@huggingface/jinja";
const wllama = new Wllama(/*...*/);
await wllama.loadModelFromUrl(/*...*/);
export const formatChat = async (wllama: Wllama, messages: Message[]) => {
const defaultChatTemplate =
"{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}";
const template = new Template(
wllama.getChatTemplate() ?? defaultChatTemplate,
);
return template.render({
messages,
bos_token: await wllama.detokenize([wllama.getBOS()]),
eos_token: await wllama.detokenize([wllama.getEOS()]),
add_generation_prompt: true,
});
};
const messages = [
{
"role": "user",
"content": "Hi!"
},
{
"role": "assistant",
"content": "Hello! How may I help you today?"
}
{
"role": "user",
"content": "How many R's are there in the word strawberry?"
},
]
const prompt = formatChat(wllama, messages);
// <|im_start|>user
// Hi!<|im_end|>
// <|im_start|>assistant
// Hello! How may I help you today?<|im_end|>
// <|im_start|>user
// How many R's are there in the word strawberry?<|im_end|>
// <|im_start|>assistant
Oh wow, diving into your info I realized there is even an abstraction layer above Transformers.js.
// Wait, no, it's just to use the API.
I've implemented your templating approach, thank you! Much simpler than creating an entire Transformers.js instance.
This might be related to https://github.com/unslothai/unsloth/issues/1065, https://github.com/unslothai/unsloth/issues/1062 - temporary fixes are provided for Unsloth finetuners, and can confirm with the Hugging Face team at https://github.com/ggerganov/llama.cpp/issues/9692 it's tokenizers
causing issues
This problem is reported on the upstream repo: https://github.com/ggerganov/llama.cpp/issues/9692
Noticed this error loading the Llama 1B and 3B models.
I'm updating Wllama now, hopefully that fixes it.