ngxson / wllama

WebAssembly binding for llama.cpp - Enabling in-browser LLM inference
https://huggingface.co/spaces/ngxson/wllama
MIT License
407 stars 21 forks source link

cannot find tokenizer merges in model file #120

Open flatsiedatsie opened 3 weeks ago

flatsiedatsie commented 3 weeks ago

Noticed this error loading the Llama 1B and 3B models.

Screenshot 2024-09-27 at 22 57 33

I'm updating Wllama now, hopefully that fixes it.

felladrin commented 3 weeks ago

I agree it may be due to the outdated version using an older version of llama.cpp, because Llama 1B is available to try on https://github.ngxson.com/wllama/examples/main/dist/ and it's working fine there. I've also tried the 3B model, and it's all good. Let us know if the update solved the issue!

flatsiedatsie commented 3 weeks ago

Odd, it didn't solve it.

I tried re-downloading the model itself, but that didn 't help.

Then I tried Firefox for comparison, and actually noticed the same error.

Screenshot 2024-09-28 at 00 58 45

I'm attempting a non-chunked version of the model next.

https://huggingface.co/BoscoTheDog/llama_3_2_it_1b_q4_k_m_chunked -> https://huggingface.co/hugging-quants/Llama-3.2-1B-Instruct-Q4_K_M-GGUF/resolve/main/llama-3.2-1b-instruct-q4_k_m.gguf

flatsiedatsie commented 3 weeks ago

bingo

flatsiedatsie commented 3 weeks ago

I re-chunked the 1B using the very latest version of llama.cpp.

Now it loads, but only outputs a single word before giving this error:

Screenshot 2024-09-28 at 09 51 14

// Looking back, this error may have just been my code trying to unload Wllama after inference was complete, and failing.

felladrin commented 3 weeks ago

Could you try these splits and confirm if they work? (Those are the ones I'm using without issues on Wllama v1.16.2)

We need first to find out if the problem is with:

  1. The split files
  2. The wllama lib
  3. Your code around wllama (maybe a config or sampling config can be causing the problem, so you can first try setting the sampling as { temp: 0 })
flatsiedatsie commented 3 weeks ago

Here is the chunked model that only outputs one word by the way: https://huggingface.co/BoscoTheDog/llama_3_2_it_1b_q4_k_m_chunked/resolve/main/llama-3_2_it_1b_q4_0-00001-of-00004.gguf

flatsiedatsie commented 3 weeks ago

Strange. Now I don't get any output.

Screenshot 2024-09-28 at 17 49 40 Screenshot 2024-09-28 at 17 48 22

I don't think I see it doing that looping action normally. Could it be loading each chunk as if it were the total model?

flatsiedatsie commented 3 weeks ago

Setting the sampling to minimal worked!

I thought that allow_offline might be the issue. But after re-enabling it last, everything still works 0_0.

I keep 'allow_offline' enabled all the time. Is that a bad idea?

felladrin commented 3 weeks ago

I thought that allow_offline might be the issue. But after re-enabling it last, everything still works 0_0.

I keep 'allow_offline' enabled all the time. Is that a bad idea?

Not at all! I leave it always enabled too! :D

Setting the sampling to minimal worked!

Interesting! Did it work with https://huggingface.co/BoscoTheDog/llama_3_2_it_1b_q4_k_m_chunked/resolve/main/llama-3_2_it_1b_q4_0-00001-of-00004.gguf ?

If so, we can conclude it is something specific with the config passed to Wllama? (If that’s the case, have you found the specific config combination that caused the issue?)

flatsiedatsie commented 3 weeks ago

I used to have this enabled all the time too, but I've removed it now.

//model_settings['n_seq_max'] = 1;
//model_settings['n_batch'] = 1024; //2048

But re-enabling it as a test had no (negative) effect.

I still have this enabled:

model_settings['embeddings'] = false;

Should I remove that?

flatsiedatsie commented 3 weeks ago

I got the vague notion that .gguf files have template information within them?

Currently I use Transformers.js to turn a conversation dictionary into a templated string, and then feed that into the AI model.

Is there a way that I can skip that step and feed a dictionary of a conversation into Wllama?

[
    {
        "role": "user",
        "content": "How many R's are there in the word strawberry?"
    },
    {
        "role": "assistant",
        "content": "There are 2 R's in the word \"strawberry\"."
    }
]

Aha! There is a function to get the Jinja template from the GGUF, and then Wllama's uses a dependency on @huggingface/jinja to apply that template.

flatsiedatsie commented 3 weeks ago
Screenshot 2024-09-29 at 09 52 22
felladrin commented 3 weeks ago
Screenshot 2024-09-29 at 09 52 22

Does this error also happen when configuring Wllama with n_threads: 1? (forcing it to single-thread)

felladrin commented 3 weeks ago

I got the vague notion that .gguf files have template information within them?

Currently I use Transformers.js to turn a conversation dictionary into a templated string, and then feed that into the AI model.

Is there a way that I can skip that step and feed a dictionary of a conversation into Wllama?

[
    {
        "role": "user",
        "content": "How many R's are there in the word strawberry?"
    },
    {
        "role": "assistant",
        "content": "There are 2 R's in the word \"strawberry\"."
    }
]

There is, by using the @huggingface/jinja package (the same as Transformers.js uses).

Here's the same logic used in https://github.ngxson.com/wllama/examples/main/dist/:

import { Template } from "@huggingface/jinja";

const wllama = new Wllama(/*...*/);

await wllama.loadModelFromUrl(/*...*/);

export const formatChat = async (wllama: Wllama, messages: Message[]) => {
  const defaultChatTemplate =
    "{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}";

  const template = new Template(
    wllama.getChatTemplate() ?? defaultChatTemplate,
  );

  return template.render({
    messages,
    bos_token: await wllama.detokenize([wllama.getBOS()]),
    eos_token: await wllama.detokenize([wllama.getEOS()]),
    add_generation_prompt: true,
  });
};

const messages = [
    {
        "role": "user",
        "content": "Hi!"
    },
    {
        "role": "assistant",
        "content": "Hello! How may I help you today?"
    }
    {
        "role": "user",
        "content": "How many R's are there in the word strawberry?"
    },
]

const prompt = formatChat(wllama, messages);
// <|im_start|>user
// Hi!<|im_end|>
// <|im_start|>assistant
// Hello! How may I help you today?<|im_end|>
// <|im_start|>user
// How many R's are there in the word strawberry?<|im_end|>
// <|im_start|>assistant
flatsiedatsie commented 3 weeks ago

Oh wow, diving into your info I realized there is even an abstraction layer above Transformers.js.

// Wait, no, it's just to use the API.

flatsiedatsie commented 3 weeks ago

I've implemented your templating approach, thank you! Much simpler than creating an entire Transformers.js instance.

danielhanchen commented 3 weeks ago

This might be related to https://github.com/unslothai/unsloth/issues/1065, https://github.com/unslothai/unsloth/issues/1062 - temporary fixes are provided for Unsloth finetuners, and can confirm with the Hugging Face team at https://github.com/ggerganov/llama.cpp/issues/9692 it's tokenizers causing issues

ngxson commented 3 weeks ago

This problem is reported on the upstream repo: https://github.com/ggerganov/llama.cpp/issues/9692