turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.2k stars 235 forks source link

Some Yi-34b models can't produce spaces. One I just quantized does. Regression? #353

Closed tau0-deltav closed 4 months ago

tau0-deltav commented 4 months ago

Writing while sleepy but i've listed two models and one of them's malf'in so I hope that's the most important stuff communicated clearly, I didn't see what the original bug 0.0.13.post1 was supposed to fix. I don't suppose it looked like this? Spot how the spaces disappear after the second paragraph and never show up at all in the token breakdown.

RPMERGE-1 This is https://huggingface.co/brucethemoose/Yi-34B-200K-RPMerge-exl2-40bpw/tree/main I think this is notable because this model used to work, and now it doesn't. Given all the trouble people report with tokens from 34B in the first place, it's less of a surprise if it just never worked.

Meanwhile I (shamelessly) quantized https://huggingface.co/dreamgen/opus-v1-34b today. no0001_26-02-no01-07h01m38s And we can see despite being a fish out of water, and despite dreamgen saying themselves that GGUF is borked (albeit for a mistral size) and their tokenizer may be broken, we get both _spaces included in the tokenisation of the input text (thanks for that btw I wasn't feeling inspired.) and the generated stuff after the second paragraph.

You might note that bruce uploaded a new tokeniser.model more recently with his own repos for his Yi merge quants, but I think they addressed a problem with text-generation-webui? Previously these models worked fine without one and including it changes nothing (it's the same hash as the base modl's anyway

I am using exllamav2-0.0.14+cu121-cp311-cp311-linux_x86_64.whl with all latest stable nvidia/cuda/linux everything. Torch 220, whatever drivers a 3090 on up to date Arch linux ought to have. Spotted the problem on tavern with a pre 0.0.14 cd exllamav2 && git pull %% pip install (-e?) . self build though. Issue has persisted through restarts, kernel upgrades though I've not replaced the venv yet. (my internet is slow, my experiences with pip negative, and my sloth unfathomably vast)

My feeling is the issue is here, given the patch history.:

Opus has a Vocab size of 64000

Bruce's two Yi's that I tested (I also managed to lose a codellama model to fuse3 today - Phind was gonna be my last check) have Vocab size of 64002

no0001_26-02-no01-07h07m06s :thinking:

LMK if it's not immediately reproducable and don't try too hard - if it's not already broken for you then I expect I'm the one who broke it. I'm a little bit spooked that there's not been any bug reports for this since it showed up days ago and breaking computers is my hobby. I'm pretty sure I've ruled out my own computer-horror in this case. If it would be helpful for you I can give it a go with an older wheel or changes to config files. Peace.

(for real don't chase this one too far my computer's basically haunted)

turboderp commented 4 months ago

I don't think it has to do with the vocab size, but there was a change to the decoding logic with 0.0.14 to accommodate Qwen, which for some reason is extremely (!) inefficient using the decode function from HF Tokenizers. The change should have also worked with other models, and I've not experienced this issue with the 30 or so models I usually test on, so I'm curious to see what the deal is.

It definitely has to do with how the leading spaces are encoded. I note that token 648 decodes to it in the screenshot, but according to the tokenizer.json file, it should be ▁it. Now, the tokenizer should autodetect this and replace that with a space, and I don't see any way it could replace it with an empty string.

One thought is that the tokenizer.model file might be incorrectly built, somehow (?), but you could try removing that file and then the tokenizer should fall back on using tokenizer.json instead.

Anyway, thanks for pointing this out. Even if it works without the tokenizer.model file, it's meant to be able to work with either, so I'll try to reproduce it (in a few hours) and find out what the deal is.

tau0-deltav commented 4 months ago

(read the next comment first it's more important) -ed

Anyway, thanks for pointing this out. Even if it works without the tokenizer.model file, it's meant to be able to work with either, so I'll try to reproduce it (in a few hours) and find out what the deal.

You're most welcome. I wish I could be more help but instead it looks like I'm misleading you because:

Nope, The merged Yi models in question doesn't work either way - tokenizer.model or not. That file is, for our purposes, a red herring. I tried adding it yesterday as a fix because I noticed it was newer than the rest of the RPMerge repo. But it originally worked out of the box with tabby/exui.

Here's another model same merge scheme by the same author, but a LoneStriker quant . It has exactly the same problem. you'll observe the 60002 vocab size and absent (irrelevant?) tokenizer.model. This did work until sometime before 0.14 - but i'm not ruling out post2 and post1 when I say that.

The tokenizer.model isn't relevant here except that it may be why there's not been more bug reports - because most people use the ooba _HF loader (which is preferred) which perhaps(?) works around this using tokenizer.model.

Also worth noting is that the tokenizer.model of my working dreamgen/opus quant and the one Bruce provided for RPmerge checksum the same: they're identical. Here's where it came from.

Curiously, opus-v1-34b finetune explicitly adds new functional instruct tokens. Or tries to - not tested yet. It does at least have spaces.

I'm going to try testing some things (editing the config, older wheels). I think. I encourage you to paruse the repos just to glance at their .json files.

Looking back on the comments on it's a bit clearer now with RPmerge: the tokenizer.model was a fix for the ooba exl2_HF loader, because all of the _HF loaders need a tokenizer.model. (though IIRC it will generate one from a .json? I don't use ooba which is why I showed the issue in exui. This exact repo did not need the tokenizer.model and is not fixed by its presence for the native sampler servers,

This is all the text files in my working quant (including a misleading calibration file but I doubt that matters) opus-v1-34b-b4h8-EXL2.tar.gz Omitted is the tokenizer.model with SHA256: 386c49cf943d71aa110361135338c50e38beeff0a66593480421f37b319e1a39 which is the only one I've seen.

You can get an equivalent 'broken despite no changes since first upload' set of jsons from the lonestriker DARE Megamerge V8 quant.

tau0-deltav commented 4 months ago
turboderp commented 4 months ago

The error seems to be related to normalization rules. Normalization is evil, and apparently there are some Yi tokenizer models which switch the rules up a bit so you can't extract token strings with the usual method. But I've added a workaround in the latest commit which seems to work for now.

tau0-deltav commented 4 months ago

I'm not sure anyone understands the Yi tokenizer.

Thank you turboderp, I'll pip install exllamav2/ now. I'll note for the record that I caught myself torch 2.2.1 last night in case I'm back again. Hope not!

tau0-deltav commented 4 months ago

Yeah this appears fixed - as fixed as Yi can be at any rate. Prose is emitteed, neurons are activated. Monkey is pleased.

Thanks again.