turboderp / exllama

A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.
MIT License
2.68k stars 214 forks source link

Very poor output quality #47

Open calebmor460 opened 1 year ago

calebmor460 commented 1 year ago

I have noticed that while it massively increases the inference speed, it massively decreases the quality of the outputs, instruct models become very obstinate and give completely irrelevant responses, words become misspelled, it repeats lines over and over, and also sometimes spams Chinese letters

turboderp commented 1 year ago

I haven't seen this at all. What model are you using? And what settings?

calebmor460 commented 1 year ago

Tried on Chronos 13b, wizard-lm 13b, and Pygmalion 7b, I used temperatures in between 0.5 and 1, and a context length of 2048, lower temperatures do seem to wrangle it into behaving a little more, but I have to lower the temperature so much the output is too “dry” to be useful . However using the same settings and models on normal GPTQ yields satisfactory results(Albeit at unsatisfactory speed)

turboderp commented 1 year ago

And just to be clear, is this in ExLlama's web UI or in Ooba?

calebmor460 commented 1 year ago

occam's fork of koboldAI that allows using exllama

using gptQ, said fork behaves normally

Panchovix commented 1 year ago

Not OP, but for context, the Kobold fork is here if you want to check it, turbo.

https://github.com/0cc4m/KoboldAI/tree/4bit-plugin (KoboldAI implementation to support GTPQ and exllama)

https://github.com/0cc4m/exllama (exllama fork on transformers branch, which builds exllama to work on Kobold)

They added Kobold samplers to that exllama fork.

So it seems these samplers are added

image

(I'm not sure about rep pen slope though)

turboderp commented 1 year ago

Okay. I really have enough work cut out for me with this, but I guess I should try installing Kobold at some point to see how they're using it. I would assume they're just taking the logits and passing them to the same samplers they use for other models, and that should just work. But there are some peculiarities to keep in mind, specifically regarding the cache, and that "context tokens" slider looks a little suspect. But idk.

calebmor460 commented 1 year ago

you think maybe the code wasn't hooked up to the context correctly and it's actually running on incredibly low context size?

turboderp commented 1 year ago

I'm not sure what that slider does, but if it truncates the cache that would definitely lead to degenerate output since the position embeddings for cached entries would be wrong. But, looking at the Transformers wrapper they added I think it's just an issue with how the cache is being passed around. It has to stay in sync with the sequence for every forward pass.

E.g. if the model generates an EOS token, and their generator doesn't add that to the running sequence, it has to be removed from the cache. Or something similar along those lines. The cache being out of sync is the kind of thing which might leave it working poorly without crashing. But I'd have to install it and run it in a debugger to make sure. Which I will. After doing some other stuff first.

calebmor460 commented 1 year ago

Alright then, thank you for taking a look at it

0cc4m commented 1 year ago

@turboderp Apologies for this, this should have gone to me directly.

I do use the KoboldAI samplers, here's the code if you're interested. It seems to work the first time or times you generate, but breaks afterwards. I'm not yet sure why. I do call generator.gen_begin(gen_in), which resets the cache as far as I know.

turboderp commented 1 year ago

Yes, using the KoboldAI samplers is the obvious choice for integrating into Kobold, so that's great. There's nothing special about the logits, after all. In fact you should just be able to bypass ExLlamaGenerator altogether and call the forward pass directly.

I'm going to install the 4-bit branch and have a play with it later today. But I don't see anything immediately wrong with how you're using it. gen_begin() should indeed reset the cache (gen_begin_reuse() should work as well, but much faster in some cases), and you're appending every token produced by the forward pass so the cache should stay in sync with the sequence.

I'll have a look though. It shouldn't be too hard to spot if the cache and the sequence go out of sync somehow.

0cc4m commented 1 year ago

It's not yet that user-friendly to install, you need to clone the branch, run install_requirements.sh and then install the exllama package into the conda env with ./commandline.sh, pip install git+https://github.com/0cc4m/exllama. Then you can run it with ./play.sh

calebmor460 commented 1 year ago

I can confiirm my issue is no longer present after Occam's latest commit to his koboldAI fork, thank you very much for your help.

0cc4m commented 1 year ago

But... I didn't fix anything yet.

turboderp commented 1 year ago

Me neither. I'm still struggling to get it to load a model. :)

0cc4m commented 1 year ago

@turboderp Let me know if you need help.

blauzim commented 1 year ago

I'm seeing a similar degradation in output quality. It used to match autogptq output quite closely but latest releases seem to be producing different results. I can get back previous quality results by setting

ExLlamaConfig.fused_attn = False

hope this can help chase things down.

turboderp commented 1 year ago

Well, it's up and running. I was just using a model that didn't have any gptq_bits key in its config and I got stuck on why it wasn't being recognized. Kind of a lot going on in aiserver.py. Maybe you should refactor to less than 10k lines? ;) But it's fine now.

I had to skip the call to tpool.execute() in generate(), just calling model.core_generate() directly in order to debug in PyCharm, but I don't see that having any side effects in this case.

I'm just not seeing anything amiss. It's correctly resetting the cache on each pass, then generating one token at a time and the cache grows as it should, staying exactly one token behind the sequence, and there really isn't much else happening.

The output also looks reasonable. Just trying with 7B Llama, but with the storywriter preset it is telling me a very cute little story that doesn't seem to be degenerating with multiple passes. It does the thing that small models like to do where it starts repeating itself, but you can throw it off by adding in "Until suddenly..." or some such, and all that behaves as I'd expect.

If I swap out gen_begin with gen_begin_reuse, it even seems to be correctly reusing the cache and only re-evaluating the prompt from the first changed token, to further show that it's working. I'm not sure how useful that feature is in Kobold since you're not truncating the sequence in larger steps, so it would only accelerate things until the context is filled up. And prompt eval is really fast already, so idk.

But all in all... I can't find anything wrong at the moment.

turboderp commented 1 year ago

The fused attention step is mathematically equivalent to the regular attention, but there might be slight differences related to numerical precision. Maybe if some of the sampling methods are extremely sensitive?

It would help if I could reproduce it. Exactly what model and settings are you using to make this happen?

blauzim commented 1 year ago

Here's a adjusted snippet of the code - nothing too complicated. llama is a python class which executes a prompt. I've had the same issue / tried it with multiple different models from thebloke. It might just be a user issue with how I'm using the exllama code. I've setup my code to run with either exllama, autogptq, gptq-for-llama, and llama.cpp so have been comparing them and noticed this difference / issue.

llama.model_path = "models/Nous-Hermes-13B-GPTQ"
llama.tokenizer_model_path = llama.model_path + "/tokenizer.model"
llama.model_config_path = llama.model_path + "/config.json"
llama.model_safetensors_path = llama.model_path + "/" + [x for x in os.listdir(llama.model_path) if x.endswith('.safetensors')][0]
llama.config = ExLlamaConfig(llama.model_config_path)
llama.config.model_path = llama.model_safetensors_path
# llama.config.fused_attn = False
llama.config.max_seq_len = 2048
llama.model = ExLlama(llama.config)
llama.cache = ExLlamaCache(llama.model)
llama.tokenizer = ExLlamaTokenizer(llama.tokenizer_model_path)
llama.generator = ExLlamaGenerator(llama.model, llama.tokenizer, llama.cache)
llama.generator.settings.token_repetition_penalty_max = 1.2

with torch.no_grad():
    # torch.manual_seed(42)
    llama.generator.end_beam_search()

    ids = llama.generator.tokenizer.encode(prompt)
    #llama.generator.gen_begin(ids)
    llama.generator.gen_begin_reuse(ids)

    for i in range(request.max_tokens):
        token = llama.generator.gen_single_token()
        llama.generator.gen_prune_left
        if token.item() == llama.generator.tokenizer.eos_token_id: break
        for eos_token in stopping_criteria_list :
            if llama.generator.sequence_ends_with(eos_token) :
                break
    generated_ids = llama.generator.sequence[0][len(ids[0]):]
    generated_text = llama.generator.tokenizer.decode(generated_ids)
turboderp commented 1 year ago

I'll have to try and see if I can reproduce it. One thing that stands out is the call to gen_prune_left() which I haven't looked at in ages. I think it's buggy when called during a beam search. Otherwise, calling it in the generation loop would continually reset the cache so performance would suffer a lot. Maybe it's just a copy/paste error?

Hermes is a model I haven't tested, though. I have found some finetunes to be strangely sensitive to rounding errors. I'll have to check that one out I guess.

turboderp commented 1 year ago

I wrote a quick little script to try and spot any difference in the output between fused and regular attention:

from model import ExLlama, ExLlamaCache, ExLlamaConfig
from tokenizer import ExLlamaTokenizer
from generator import ExLlamaGenerator
import torch
import os, glob

torch.set_grad_enabled(False)
torch.cuda._lazy_init()

model_directory =  "/mnt/str/models/_test_models/TheBloke_GPT4All-13B-snoozy-GPTQ/"
# model_directory =  "/mnt/str/models/llama-13b-4bit-128g/"

tokenizer_path = os.path.join(model_directory, "tokenizer.model")
model_config_path = os.path.join(model_directory, "config.json")
st_pattern = os.path.join(model_directory, "*.safetensors")
model_path = glob.glob(st_pattern)[0]

# Create config, model, tokenizer, generator

config = ExLlamaConfig(model_config_path)
config.model_path = model_path
model = ExLlama(config)
cache = ExLlamaCache(model)
tokenizer = ExLlamaTokenizer(tokenizer_path)
generator = ExLlamaGenerator(model, tokenizer, cache)
generator.disallow_tokens([tokenizer.eos_token_id])
generator.settings.token_repetition_penalty_max = 1.2
generator.settings.temperature = 0.6
generator.settings.top_p = 0.5

# Build a growing prompt

print ("")
print ("------------------- Regular attention --------------------")
print ("")

config.fused_attn = False
prompt = "Once upon a time,"
gen_tokens = 128
torch.manual_seed(69420)
for i in range(5): prompt = generator.generate_simple(prompt, max_new_tokens = gen_tokens)
print(prompt)

print ("")
print ("------------------- Fused attention --------------------")
print ("")

config.fused_attn = True
prompt = "Once upon a time,"
gen_tokens = 128
torch.manual_seed(69420)
for i in range(5): prompt = generator.generate_simple(prompt, max_new_tokens = gen_tokens)
print(prompt)

This seems to consistently produce roughly the same output.

Now, I say roughly, but it's important to note that even with a fixed seed the implementation is always ever so slightly non-deterministic, which comes down to floating-point addition being non-associative and CUDA providing no guarantees about the order in which threads are launched. The difference is always small, but it's made a little larger by the use of FP16, where some other implementations use FP32, at least for intermediate results.

It's larger still in the fused attention because at the very end I've optimized away the addition of the residual connection by just doing the last matmul straight on top of the residual state. Mathematically that's the same thing, but it does change the order of additions quite a bit for potentially a more different rounding behavior in the end.

Still, the differences are small in any case, and even though the generation happens in multiple steps, I'm just not seeing much divergence. And both are staying coherent, although that Hermes model really likes to write song lyrics for some reason. But it seems equally likely to do that with or without fused attention.

blauzim commented 1 year ago

Thanks, I can run the sample code you provided have and it works cleanly. So must be an issue in the code I'm using / how exllama is being called. The code is trying to be general between all the various GPTQ implementations so might have some cruft causing issues. Will do more testing and see if i can find out why.

blauzim commented 1 year ago

Did some further digging. Seems to be related to creating the generate and tokenizer objects inside the "llama" class. When created at the top level it works, but when the exllama objects are created in a class it has the fused attention difference. Could it be some scoping issues? For the sample below it generates different creative texts. But when used for instruction following it produces very bad results when fused attention is on.

output :

Injected compiler path: C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.36.32532\bin\Hostx64\x64

------------------- With Fused Attention == True version --------------------

Once upon a time, the only way to get your hands on new music was by waiting for it to come out or finding an underground tape trading scene. Nowadays you can stream and download songs instantly from anywhere in world with just few clicks of mouse button! The internet has also made sharing information about bands much easier than before – through social media sites like Facebook & Twitter as well blogs that cater specifically towards independent musicians (like this one). This makes discoverability so important because now anyone who wants access to their favorite band’s latest single without having any connection within industry gatekeepers such us record labels A&R people

Injected compiler path: C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.36.32532\bin\Hostx64\x64

------------------- With Fused Attention == False version --------------------

Once upon a time, the only way to get your hands on new music was by waiting for it to be released and then going out to buy […] Filed Under: Entertainment Tagged With: Apple Music, Beats 1 Radio Station

code :

from model import exllama, exllamacache, exllamaconfig
from tokenizer import exllamatokenizer
from generator import exllamagenerator
import torch
import os, glob

torch.set_grad_enabled(false)
torch.cuda._lazy_init()

######  set this to change from / to fused attention
_use_fused_attention = false

print ("")
print ("------------------- with fused attention == " + str(_use_fused_attention) + " version --------------------")
print ("")
model_directory =  "../../llama/nous-hermes-13b-gptq"
class llama_ex :
    model_directory :str | none = none
    tokenizer_model_path :str | none = none
    model_config_path :str | none = none
    model_safetensor_path :str | none = none
    n_ctx : int = 2048
    config : exllamaconfig  | none = none
    model : exllama | none = none
    cache : exllamacache | none = none
    tokenizer : exllamatokenizer | none = none
    generator : exllamagenerator  | none = none

    def __init__(self, *args, **kwargs):
        for this_param in list(set(dir(self)) & set(kwargs.keys())) :
            setattr(self, this_param, kwargs[this_param])
        self.model_tokenizer_path = os.path.join(self.model_directory, "tokenizer.model")
        self.model_config_path = os.path.join(self.model_directory, "config.json")
        self.model_safetensors_path = os.path.join(self.model_directory, [x for x in os.listdir(self.model_directory) if x.endswith('.safetensors')][0])
        self.config = exllamaconfig(self.model_config_path)
        self.config.model_path = self.model_safetensors_path
        self.config.fused_attn = _use_fused_attention
        self.model = exllama(self.config)
        self.cache = exllamacache(self.model)
        self.tokenizer = exllamatokenizer(self.model_tokenizer_path)
        self.generator = exllamagenerator(self.model, self.tokenizer, self.cache)

llama = llama_ex(model_directory = model_directory)

prompt = "once upon a time,"
llama.generator.settings.token_repetition_penalty_max = 1.5
llama.generator.settings.temperature = 0.5
llama.generator.settings.top_p = 0.1
llama.generator.settings.top_k = 40
gen_tokens = 128
torch.manual_seed(69420)

generated_text = llama.generator.generate_simple(prompt, max_new_tokens = gen_tokens)
print(generated_text)
calebmor460 commented 1 year ago

so does this mean the fix has been rolled into the code, and if so, what files do I replace?

turboderp commented 1 year ago

There isn't a fix, no, because I haven't been able to reproduce the problem yet. I'm working on a thorough perplexity test to run with all the different possible code paths, which should highlight if there are any significant differences in how the model evaluates depending on tuning parameters.

I know there are some numerical differences, at least, and it's possible that this divergence is just the result of the model ending up at a "tipping point" and then going down one path or another based on some small shift in the probabilities. But that's not the same as poor output quality, though. There isn't a "correct" choice for any one token. So unless something is actually breaking and resulting in a broken probability distribution, what you really want is to avoid those tipping points in the first place.

I'll know more once these tests are set up. In the meantime you could try the new typical sampling feature, which does seems to produce more consistent results overall.

calebmor460 commented 1 year ago

I will try that when I get a chance to, thank you

QM60 commented 1 year ago

For what it's worth, I've noticed output quality issues as well in Kobold, which I assumed was related to the sampling swap. However, I noticed similar issues with ooba's very recent exllama support, which doesn't touch exllama's native sampling.

One revealing thing. I was using Wizard-Vicuna-30b, which uses </s> as part of its prompt format. I noticed that I got "</s>" (as in the literal string, not the EOS token) creeping into the output, which never happened with normal transformers. This suggests that exllama is not interpreting </s> as a special token. If it doesn't check special_tokens_map.json, that would explain some things. In addition, I had issues with very early/jarring EOS, and contraction fumbling (emitting words like can'm, don've, etc.) which is normally only an issue with GPTQ models that don't use desc_act. Neither happened with regular GPTQ. The early stopping may be a symptom of incorrect interpretation of </s> in the prompt, but I'm not sure if that's plausible for contraction fumbling.

turboderp commented 1 year ago

Kobold doesn't use ExLlama's sampling, only logits from the model. Ooba does use the native sampling, though, as well as ExLlama's tokenizer which is just a straight SentencePiece instance reading the model file directly. I'll have to dig into the Transformers tokenizer to see if it does something special.

The special tokens map shouldn't lead to you seeing "\</s>" in the output, especially when the file that defines that string isn't being read. I can look into some ways to take special_tokens_map.json into account, but it's going to be a little tricky when you have models on HF where that file looks like this:

{
  "bos_token": "</s>",
  "eos_token": "</s>",
  "pad_token": "[PAD]",
  "unk_token": "</s>"
}

The contractions are interesting, at least. Seems too oddly specific to not be a tokenizer issue, but I'm not sure what to make of it. I'll try to see if I can reproduce it. Have you seen it in Kobold too or just in Ooba?

QM60 commented 1 year ago

Seen it in both, but it's happening constantly in ooba, every other reply. It's very weird. It manifests in a few ways: just forgetting to finish (doesn') finishing with a weird token (can'the) or cutting off the whole generation at a contraction (doesn'<EOS>). Again, the only other time I saw this was with 128g cuda models without act-order, but it was rarer, and I assume it was just quantization error. Fascinating that issues can manifest like this.

For the emitting </s> issue, I suspect this is happening because it's interpreted as a normal string (in the prompt from the chat history), causing the model to assume it should end generations with it. GPTQ parses it as EOS. This might be causing some issues, but I tried removing the EOS tokens from the prompt entirely, and the contraction glitches are still there. Weird. I hear you on the weird model configs, although for models that expect EOS in the prompt (like those trained on vicuna 1.1 formats) I should hope they didn't do that.

For what it's worth, the initial exllama branch in Kobold (which was very early, before most of your optimizations, or even support for non-groupsize models) didn't have any generation bugs at all that I could detect.

Panchovix commented 1 year ago

Sorry for the question here, but the only samplers missing now are tfs (tail free sampling) and top_a, right?

turboderp commented 1 year ago

@QM60 : Are you seeing it with different models or just a particular one? Is there one that does it more than others?

turboderp commented 1 year ago

Okay, so I did a lot of in-depth testing and I did discover a bug in the handling of the cache during fused attention. It was a little extra sneaky because it doesn't manifest on 7B, and I've been testing a bunch on 7B because it's faster, and just validating on 13B and 33B, but not thoroughly enough to notice the difference.

Anyway, I don't know if this fixes all the issues, but it definitely improves the output on paper.

EyeDeck commented 1 year ago

Interesting, I was just investigating the "contraction disease" issue last night, because I could swear it only started showing up after I pulled a few days ago, so I was trying different commits to narrow down exactly when it started. I was going to spend more time on it today, but it looks like 1ef63db7861ad87e99241a929a8c04e16457b7b3 did indeed fix it.

For whatever little it's worth anymore, the narrowest I'd gotten it (before realizing I was dysfunctionally tired, heh), was

3c8699434fb24d1a2bfdd29454b04fa320546135 (older) ![image](https://github.com/turboderp/exllama/assets/2722970/d4e019c0-0ebd-48a3-9de2-a97ef6638e34)
896da5d3b59252cba40aea6818621b2fbc77fbf1 ![image](https://github.com/turboderp/exllama/assets/2722970/f8116b0f-3e9f-474d-a9ed-278bedcfd837) ![image](https://github.com/turboderp/exllama/assets/2722970/1cb31e4c-6338-4aaa-a7da-3adaa70461c5) ![image](https://github.com/turboderp/exllama/assets/2722970/0dfb7f3b-4765-4890-ac9d-4663f2cbae58)
dd63e0734b7df5fcbd86d30ad82a582da25a3a73 (latest for comparison) ![image](https://github.com/turboderp/exllama/assets/2722970/e8a5d1b3-a480-4c38-9eae-a5d4497d6b4d) ![image](https://github.com/turboderp/exllama/assets/2722970/07cde05d-32a7-4d59-96fc-6df171681920) ![image](https://github.com/turboderp/exllama/assets/2722970/41fba2aa-dac0-4ce9-addc-e1064c2e95e7)

So it's very plausible that b65d774c1bd4fbf23405e9f97e2e58da8109543b introduced it.

QM60 commented 1 year ago

Anyway, I don't know if this fixes all the issues, but it definitely improves the output on paper.

Massive improvement for me, the contraction issues are gone. The only remaining issue is the </s> one. Which is interesting, because sentencepiece is deliberately designed NOT to read any control tokens (including EOS) from a normal text stream - which makes complete sense! But the normal llama tokenizer does parse </s> as one token, and vicuna-1.1 and derived models rely on seeing EOS in the prompt. Empirically, for Wizard-Vicuna-30b, omitting it from the prompt is not too bad, but makes it prone to emit very short or even empty replies fairly frequently.

Source on vicuna 1.1 prompt format, where they explicitly suggest using </s>: https://github.com/lm-sys/FastChat/blob/7ae721fa3c881e1e24cf181305d127a316acd463/docs/vicuna_weights_version.md#example-prompt-weight-v11

Honestly, I could see the argument that llama's behavior is a bug, but that's what it is. No API lets you inject token IDs into a prompt, so people are used to embedding control tokens into text. I'm not sure if using plain sentencepiece instead of llamatokenizer will cause other issues though. Transformers seemed to have multiple tokenization bugs for llama which suggests something might be tricky about it.

QM60 commented 1 year ago

Alright, I celebrated too fast. Contractions are okay but there is still something off about the output from exllama. I suspect it's a sampling issue; perhaps related to sampling order? To identify the differences, I recommend using the sphinx-moth preset from ooba, which has very high temperature mediated by strict top-p and top-k. I think this is more susceptible to whatever the difference is.

Here is a simple comparison using ooba with a fixed seed, llama 30B (plain) quantized to 4 bit with act-order and no groupsize. Settings included. The difference in sanity speaks for itself although I'm kinda fond of the exllama version's energy

exllama comparison.txt

turboderp commented 1 year ago

Well, the model is still FP16, regardless, whereas GPTQ-for-LLaMa can also be FP32 depending on how you use it. So it is more susceptible to numerical instability, even if it has no meaningful impact on perplexity.

I would question if sampling with a high temperature is really a good way to get more "varied" output anyway, compared to typical or tail-free sampling, or some such. It makes a lot of assumptions about how the model is supposed to act way outside of the conditions it was trained in, with some inherent assumptions about numerical accuracy. So getting it to work just right by tweaking magic numbers is a balancing act. You could try running without fused attention and MLP, or maybe try lowering the top-p threshold to see if that produces more equivalent output by cutting out some more of the noise on the tail-end of the probability distribution.

I'll have a look anyway a little later. There might be issues with the order of operations. We'll see. But generally speaking, to produce exactly the same output as other implementations in the extremes, I'd have to emulate those other implementations much more carefully, maybe even down to timings in the kernels to get CUDA threads to launch in the same order.

And of course FP32 would eat up a lot more VRAM. I have other ideas for improving accuracy, though, building on the LoRA support I'm currently working on.

QM60 commented 1 year ago

I was skeptical that FP32 would explain that kind of a difference in output, so I looked into it, and it's just a sampling difference. For exllama, top_p sampling inherits the base probabilities, from before top_k was applied. That means the same value of top_p is potentially a lot more strict in transformers, where probabilities are normalized at each step.

To confirm, I was able to get the same subjective level of output quality simply by adding one line after if top_p > 0.0: in generator.py to rescale probabilities.

top_probs = top_probs / torch.sum(top_probs, dim = -1)

This could explain some output differences reported here.

Panchovix commented 1 year ago

I was skeptical that FP32 would explain that kind of a difference in output, so I looked into it, and it's just a sampling difference. For exllama, top_p sampling inherits the base probabilities, from before top_k was applied. That means the same value of top_p is potentially a lot more strict in transformers, where probabilities are normalized at each step.

To confirm, I was able to get the same subjective level of output quality simply by adding one line after if top_p > 0.0: in generator.py to rescale probabilities.

top_probs = top_probs / torch.sum(top_probs, dim = -1)

This could explain some output differences reported here.

Just tried this and indeed gets the same level of output quality. Nice catch!

turboderp commented 1 year ago

Well... I'm not fond of making the sampling parameters interdependent in this way. But I guess it doesn't matter all that much since it's kind of all trial-and-error anyway, getting those parameters just right for subjectively satisfying output. So I've pushed an update to normalize the distribution after each sampler.

Larryvrh commented 1 year ago

I wonder whether it is possible to utilize the original decoding pipeline from huggingface transformers to get a more "aligned" result text given that logits are most likely be similar?

turboderp commented 1 year ago

It would be possible, I assume. KoboldAI does a similar thing to plug ExLlama's logits into the same samplers used for other implementations. I think it's much too limiting, though.

QM60 commented 1 year ago

Man... it's better, but I swear something's still off. I switched to Kobold just to rule out the tokenizer and samplers (which are the same there) but exllama still has seizures ever so often. A few highlights:

This doesn't seem likely to be a sampler issue, and I never saw things like that from GPTQ with the same model/settings. It's much better than before, but I'd still view the current forward pass with a bit of suspicion atm...

turboderp commented 1 year ago

Are these still with extremely high temperature?

turboderp commented 1 year ago

Suppressing the EOS token might be an instance where the model is extra sensitive to noise, or rounding errors from the FP16 math. If the model is 99.9% sure it wants to emit an EOS token, picking from the remaining 0.1% instead might be asking for trouble.

I'm thinking maybe it's more correct to lower the temperature by some amount proportional to the likelihood of the EOS token, or any other token that gets masked out. Maybe the distribution just ends up being really flat otherwise.

It's interesting that I've never really seen this myself, though.

QM60 commented 1 year ago

For what it's worth, I've never seen Kobold suppress EOS when called via its api. It doesn't do that for me; I have generations stop before max tokens have been reached quite often. And my results didn't come from "super high temperature", either, I believe it was about 0.7 with tfs 0.9. But still, sphinx-moth is a normal, popular preset, despite high temp, and as strict as it is with top-k and top-p, tiny FP16 rounding errors aren't likely to cause these kinds of issues.

Banning EOS can cause gibberish, although usually it's a different kind. But the thing is, 90% of gens feel completely normal, and the seizures are sudden and occur at random places in the middle of the text, not at natural stopping points. (The contraction is a clear example of that)

I have no idea how to debug this, though, if the results are nondeterministic due to concurrency. Except maybe fuzzing and comparing logits with transformers output within a tolerance. But that would only detect the bugs, not find them.

turboderp commented 1 year ago

Fuzzing is probably the way to go. Perplexity tests aren't revealing anything out of the ordinary, but then they wouldn't if it's an intermittent thing that just corrupts a few numbers under some special conditions. That wouldn't affect the average over thousands of tokens.

I think I'll set up a script to run the same inference in ExLlama and GPTQ-for-LLaMa and look for anywhere the logits deviate considerably between the two. That should at least let me reproduce the error.

QM60 commented 1 year ago

For what it's worth, so far, ooba's exllama_hf adapter is giving me flawless output, or at least, indistinguishable from transformers. Not terribly surprising since I think it should have identical tokenizer and sampler behavior.

In summary, what I noticed:

The forward pass might be perfectly fine after all.

turboderp commented 1 year ago

Well, the forward pass is mutating all the time, and there definitely were some issues with it that have been resolved.

The comparison between the HF samplers and the ones in ExLlama's generator is kind of apples-to-oranges. Without some concrete examples of generations that go "wrong" in ExLlama when they shouldn't, it's really hard to do anything about it. All I get out of that is basically "I tried strumming my bass the same way I strum my guitar, and it doesn't sound the same, so I guess my bass is broken." There could absolutely be bugs in the implementation, but it's impossible to find them based on a subjective sense of something being off.

I could just keep making it more and more identical to Transformers, but the whole point of the project wasn't to create a Transformers-compatible plugin for Ooba and Kobold, it was to build an alternative platform for experimenting with techniques that don't fit well in Transformers.

The tokenizer still shouldn't behave differently, though. With regards to BOS tokens, yes, but that's a separate issue of figuring out what to do about all the incorrect tokenizer configs floating around on HF.

QM60 commented 1 year ago

Was not complaining about samplers, just noting my experiences. I don't expect exllama's built-in samplers to be identical at all! If it needs new presets, that's fine. Swapping in known samplers is a very useful way to control for differences, though. And that's important, since it suggests that any complaints you're getting probably aren't due to bugs in the forward pass; at least, not anymore. That's good!

The tokenizer definitely does behave differently though, because it's just wrapping sentencepiece, which explicitly says it doesn't parse special tokens like </s> (by design). HF's tokenizer does parse them. That's not wrong per se. It might even be desirable for someone writing their own backend, since they could insert separator tokens manually, and know that users can't inject them. But it's different!