Closed flatsiedatsie closed 4 months ago
It totally depends on llama.cpp side, but I suppose that it will be transparent to us (much like how support for phi-3 is added)
Any second now.. GGerganov has already signed off on it.
I was thinking: could be cool to immediately create a demo showcasing BitNet in the browser via Wllama? I could create one, but so far I've not been able to compile Wllama myself. I'm trying to get that working again now though.
I've managed to get compiling working (reinstalled Docker).
I then tried to use the very latest version of Llama.cpp with BitNet support. Unfortunately I ran into an error:
and another attempt:
and
Got a bit further. This model works fine on llama.cpp, but has some strange behaviour in wllama. But I'll try fiddling with some settings.
I've put a quick (still broken) demo here: https://github.com/flatsiedatsie/bitnet_in_the_browser/
Keep pushing it, @flatsiedatsie!
Very cool @flatsiedatsie
Just my personal POV, maybe firstly we make sure that it runs absolutely stable on native llama.cpp, then we will see what's the problem when bringing it to wllama. That way we can narrow down the scope when searching for a bug.
Got a bit further. This model works fine on llama.cpp, but has some strange behaviour in wllama. But I'll try fiddling with some settings.
The generation seems to work fine. I suspect maybe something to do with special tokens and chat template
So, turns out the available model is not instruction-tuned, so we can't expect it to be usable for chat. See: https://huggingface.co/BoscoTheDog/bitnet_b1_58-xl_q8_0_gguf/blob/main/tokenizer_config.json
At this point, it seems to be working at inference level. Not much we can do to fix the chat issue (which requires training / fine tuning the model). I'll release a new version of wllama with up-to-date source code so we're ready when new models appear.
Also note that, mul_mat is not (yet) optimized for bitnet, so we can't expect good level of performance out of llama.cpp.
Here's another specialized version of the model, straight from the horse's mouth:
https://huggingface.co/imi2/test/resolve/refs%2Fpr%2F10/bitnet_b1_58-3B-Q1_3-1.63bpw.gguf
via https://www.reddit.com/r/LocalLLaMA/comments/1dnbf6s/33b_bitnet_test_on_1gb_ram_retro_handheld/ (worth a read)
I couldn't get that to run on my mac yesterday though, probably because up to an hour ago it wasn't optimized for ARM yet. Things move so fast..
There's also the 700M BitNet model (named "large") which in Q1_3 takes only 171 MiB, which is useful if you're very low on RAM. And its speed is more usable on low-end devices (getting close to 7 tokens per seconds on 4 Arm cortex-a53 cores (phone), and 7.5 tok/s with 4 cortex-a72 cores (rpi4)).
I found a bitnet model that, in theory, is instruct tuned.
Running it results in the same gobbledygook though.
I'm uploading the Q16 .gguf version I made here.
I also tested it with llama.cpp and it runs, although it seems to have a fascination with Cow emojis.
This almost seemed like an answer at first:
Ah, this may explain the cow theme:
And the name "Bessie" Seems to be a good output for this model too.
I must admit I still don't quite grasp why it doesn't work.
I've kept trying with various models, but they all output.. repetitive strangeness. Often large amount of newlines.
Even if they are just base models, feeding them "Once upon a time" should result in something of a story, right? And since it does seem to work with the llama.cpp binary, what could be different about the wasm version?
IMO the most simple way is to test if the same prompt works with original HF transformers python library. After all, someone uploads a model doesn't mean it will work.
OMG I FIGURED IT OUT!
All I had to do was add <s>
before the prompt!
That's great news @flatsiedatsie!
Now we just need to find a good instruction-tuned one. Keep us posted! 🎉
Thanks! Will do!
This error could be my "work offline" modification, but I thought I'd share:
I'm curious: if Llama.cpp adds BitNet support (which looks imminent), will Wllama be able to run Bitnet models too by simply upgrading the Llama.cpp version? Or will there be more things that need to happen?
https://github.com/ggerganov/llama.cpp/pull/7931