ngxson / wllama

WebAssembly binding for llama.cpp - Enabling on-browser LLM inference
https://huggingface.co/spaces/ngxson/wllama
MIT License
444 stars 23 forks source link

BitNet support #69

Closed flatsiedatsie closed 4 months ago

flatsiedatsie commented 5 months ago

I'm curious: if Llama.cpp adds BitNet support (which looks imminent), will Wllama be able to run Bitnet models too by simply upgrading the Llama.cpp version? Or will there be more things that need to happen?

https://github.com/ggerganov/llama.cpp/pull/7931

ngxson commented 5 months ago

It totally depends on llama.cpp side, but I suppose that it will be transparent to us (much like how support for phi-3 is added)

flatsiedatsie commented 5 months ago

Any second now.. GGerganov has already signed off on it.

I was thinking: could be cool to immediately create a demo showcasing BitNet in the browser via Wllama? I could create one, but so far I've not been able to compile Wllama myself. I'm trying to get that working again now though.

flatsiedatsie commented 5 months ago

I've managed to get compiling working (reinstalled Docker).

I then tried to use the very latest version of Llama.cpp with BitNet support. Unfortunately I ran into an error:

Screenshot 2024-06-23 at 17 45 12

and another attempt:

Screenshot 2024-06-23 at 18 23 11

and

Screenshot 2024-06-23 at 17 45 55
flatsiedatsie commented 5 months ago

Got a bit further. This model works fine on llama.cpp, but has some strange behaviour in wllama. But I'll try fiddling with some settings.

Screenshot 2024-06-23 at 22 39 06
flatsiedatsie commented 5 months ago

I've put a quick (still broken) demo here: https://github.com/flatsiedatsie/bitnet_in_the_browser/

felladrin commented 5 months ago

Keep pushing it, @flatsiedatsie!

ngxson commented 5 months ago

Very cool @flatsiedatsie

Just my personal POV, maybe firstly we make sure that it runs absolutely stable on native llama.cpp, then we will see what's the problem when bringing it to wllama. That way we can narrow down the scope when searching for a bug.

Got a bit further. This model works fine on llama.cpp, but has some strange behaviour in wllama. But I'll try fiddling with some settings.

The generation seems to work fine. I suspect maybe something to do with special tokens and chat template

ngxson commented 5 months ago

So, turns out the available model is not instruction-tuned, so we can't expect it to be usable for chat. See: https://huggingface.co/BoscoTheDog/bitnet_b1_58-xl_q8_0_gguf/blob/main/tokenizer_config.json

At this point, it seems to be working at inference level. Not much we can do to fix the chat issue (which requires training / fine tuning the model). I'll release a new version of wllama with up-to-date source code so we're ready when new models appear.

Also note that, mul_mat is not (yet) optimized for bitnet, so we can't expect good level of performance out of llama.cpp.

flatsiedatsie commented 5 months ago

Here's another specialized version of the model, straight from the horse's mouth:

https://huggingface.co/imi2/test/resolve/refs%2Fpr%2F10/bitnet_b1_58-3B-Q1_3-1.63bpw.gguf

via https://www.reddit.com/r/LocalLLaMA/comments/1dnbf6s/33b_bitnet_test_on_1gb_ram_retro_handheld/ (worth a read)

I couldn't get that to run on my mac yesterday though, probably because up to an hour ago it wasn't optimized for ARM yet. Things move so fast..

There's also the 700M BitNet model (named "large") which in Q1_3 takes only 171 MiB, which is useful if you're very low on RAM. And its speed is more usable on low-end devices (getting close to 7 tokens per seconds on 4 Arm cortex-a53 cores (phone), and 7.5 tok/s with 4 cortex-a72 cores (rpi4)).

flatsiedatsie commented 5 months ago

I found a bitnet model that, in theory, is instruct tuned.

Running it results in the same gobbledygook though.

Screenshot 2024-06-26 at 17 54 21

I'm uploading the Q16 .gguf version I made here.

I also tested it with llama.cpp and it runs, although it seems to have a fascination with Cow emojis.

Screenshot 2024-06-26 at 17 59 44 Screenshot 2024-06-26 at 17 58 01

This almost seemed like an answer at first:

Screenshot 2024-06-26 at 18 07 30
flatsiedatsie commented 5 months ago

Ah, this may explain the cow theme:

Screenshot 2024-06-26 at 18 19 57

And the name "Bessie" Seems to be a good output for this model too.

flatsiedatsie commented 4 months ago

I must admit I still don't quite grasp why it doesn't work.

I've kept trying with various models, but they all output.. repetitive strangeness. Often large amount of newlines.

Even if they are just base models, feeding them "Once upon a time" should result in something of a story, right? And since it does seem to work with the llama.cpp binary, what could be different about the wasm version?

ngxson commented 4 months ago

IMO the most simple way is to test if the same prompt works with original HF transformers python library. After all, someone uploads a model doesn't mean it will work.

flatsiedatsie commented 4 months ago

OMG I FIGURED IT OUT!

All I had to do was add <s> before the prompt!

Screenshot 2024-07-07 at 20 31 11
felladrin commented 4 months ago

That's great news @flatsiedatsie!

Now we just need to find a good instruction-tuned one. Keep us posted! 🎉

flatsiedatsie commented 4 months ago

Thanks! Will do!

flatsiedatsie commented 4 months ago

This error could be my "work offline" modification, but I thought I'd share:

Screenshot 2024-07-07 at 21 55 32