turboderp / exllama

A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.
MIT License
2.71k stars 214 forks source link

Very poor output quality #47

Open calebmor460 opened 1 year ago

calebmor460 commented 1 year ago

I have noticed that while it massively increases the inference speed, it massively decreases the quality of the outputs, instruct models become very obstinate and give completely irrelevant responses, words become misspelled, it repeats lines over and over, and also sometimes spams Chinese letters

turboderp commented 1 year ago

I'm not taking it as complaints, don't get me wrong. But I also don't want to doubt people when they say the output is bad, like in this case there was a bug causing the forward pass to run slightly incorrectly, and it's good that I found that. I also don't blame people for not having any other way to determine if the output is off than by comparing it to other implementations, because I don't either. It's a black box after all. Just sometimes I do wish it came more in the form of: "here's the output I got, here's (exactly) how I got it, and here's why I think it's wrong."

But yeah, it's not to be confrontational or anything, I just wish there was a better way of communicating that I don't view Transformers as the standard. Maybe I just need an FAQ. :)

As for the tokenizer, I do plan to add some support special tokens. For models that rely on them it gets messy otherwise, if they have to be inserted after encoding and disappear when decoding.

turboderp commented 1 year ago

So... how do things stand with the special tokens?

Having studied it a bit, it seems the authors of SentencePiece go out of their way to explain that control symbols are categorically invalid inputs to the encoder. Meanwhile, Transformers implements a very elaborate workaround so they can be encoded anyway. I imagine those two teams don't like each other very much.

Anyway, it wouldn't be too difficult to emulate what Transformers does, but it would be kind of messy so I'm wondering how many models actually include control symbols in their prompt format. Is it unique to Wizard-Vicuna, or is it more common than that?

oobabooga commented 1 year ago

OpenAssistant also has special tokens like <|endoftext|> that were manually added to the tokenizer. See here: https://huggingface.co/OpenAssistant/oasst-rlhf-2-llama-30b-7k-steps-xor/blob/main/oasst-rlhf-2-llama-30b-7k-steps-xor/added_tokens.json

turboderp commented 1 year ago

I see. That's an XOR release, but I assume some the full releases I'm looking at are true enough to the original. It doesn't look like the tokenizer model is changed at all, so it's really all just spread across four different config files, with contradictions and all. Transformers is quite the framework..

Anyway, this makes me wonder if text-generation-webui makes any attempt at sanitizing user input, or if that's maybe just me overthinking things.

pi6am commented 1 year ago

I believe I have a lead on this part of the issue:

Kobold's exllama = random seizures/outbursts, as mentioned

I managed to reproduce the problem with logging enabled and observed the following generation:

   id       gen     RepetitionP     TopP      softmax
 29892    24.8125     22.5628     22.5628        1       [,]
  1183    18.3281     16.7737       -inf         0       [ she]
  310     15.3672     14.1284       -inf         0       [ of]
  322     15.0312     13.7184       -inf         0       [ and]
 20265    1.68555     1.68555       -inf         0       [ Bened]
Selected: 20265 [ Bened]

Despite the comma being the only token with non-zero probability, KAI selected the Bened token instead. The cause is a bug in torch.multinomial that has been claim fixed but not yet released (i.e. it's still active in PyTorch 2.0.1). This bug sometimes causes the multinomial function to select items with zero weight.

I worked around this bug in the KoboldAI exllama backend by checking the selected tokens and resampling when any zero probability tokens are chosen. I've verified with logging that this avoids the issue and in testing it seems to have solved all the problems with poor output quality.