turboderp / exllama

A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.
MIT License
2.73k stars 214 forks source link

EOS tokens don't work with llama-2. #166

Closed RiyanParvez closed 1 year ago

RiyanParvez commented 1 year ago

title, and to be clear, does llama generate eos tokens? because when i increase the max tokens limit it kept on generating the user's questions and stuff too, although in the generator.py i found logic for eos tokens.

Ph0rk0z commented 1 year ago

There was a PR merged to textgen to support the new stopping tokens it has. If you are using it standalone here then it would definitely have issues.

RiyanParvez commented 1 year ago

Could you link the pr?

turboderp commented 1 year ago

I also couldn't find that PR. There's one that deals with the chat tuned model, which is its own whole thing. Has a bunch of nice edits, like this one.

As for EOS tokens, generally I don't like to rely on them. The base model is pretrained on 2 trillion tokens of text scraped from a ton of different sources, and there's no particular format to all of it. It's not question-answer pairs, dialog, chat, code snippets, paragraphs or anything like that. It's just an incredibly long stream of text used to imprint the patterns of "language" on the model. If the training data contains EOS tokens at all, there's no reason to think they'll appear where you think they should, unless the model is finetuned, but in that case it's the finetuning dataset that determines what the EOS token looks like and where it's predicted.

Personally I like to rely on newline characters as an EOS token, if I want one-line outputs. Or in a chat allowing for multi-line responses, using the user prompt ("User:" or whatever) as a stop condition is much more reliable, even with finetunes.

RiyanParvez commented 1 year ago

Yeah, i also had to use a stop string for it by doing something like this (i did that on colab so i dont have the actuall code): for i in range(max_new_tokens): token = self.gen_single_token() if self.tokenizer.decode(text) == "User": break for j in range(token.shape[0]): if token[j, 0].item() == self.tokenizer.eos_token_id: eos[j] = True if eos.all(): break that worked because "user" was a single token not multiple tokens joined together, it worked, but for some reason sometimes it kept on generating for a long time after that too.

turboderp commented 1 year ago

Yeah it gets a little more complicated, but I think it's worth it for being much more robust. In the ExLlama web UI, I do this:

        stop_conditions = []
        newline_token = torch.Tensor([[tokenizer.newline_token_id]]).long()

        if self.break_on_newline:
            stop_conditions.append((newline_token, "\n"))
        else:
            for part in self.participants:
                txt = part + ":"
                sc = tokenizer.encode(txt)
                sc = torch.cat((newline_token, sc), dim=1)
                stop_conditions.append((sc, "\n" + txt))
                stop_conditions.append((sc, "\n " + txt))

...

            for stop_tokens, stop_string in stop_conditions:
                if res_line.lower().endswith(stop_string.lower()):
                    generator.gen_rewind(
                        stop_tokens.shape[-1] - (1 if stop_tokens[0, 0].item() == tokenizer.newline_token_id else 0))
                    res_line = res_line[:-len(stop_string)]
                    stop_condition = True
                    break
            if stop_condition: break

This also allows multiple stop conditions, e.g. if one bot outputs along the lines of Chatbot: The answer is 42.\nChatbot: Do you have any other questions for me? or if you have multiple bot personas with different names. And it accounts for leading space weirdness in the tokenizer.

For streaming, it simply diverts text that could be part of a stop condition to a separate buffer string and only streams the buffer string when it stops matching any of the stop conditions.

EyeDeck commented 1 year ago

Apologies for derailing slightly, but this reminded me, I had a thought about stop conditions the other day; a common complaint when using a chatbot is that responses are too short, which frequently happens when there are a few short responses in a row and the LLM just wants to keep mimicking that pattern forever. How would the model respond if, instead of stopping when it sees "User:" (or whatever it's set to obviously), the most recent few tokens were popped, the first token of "User:" was temporarily banned for the next few tokens, then generation were to be resumed? I can't see why it wouldn't work, as long as the model doesn't lose its mind because it's 99% sure that it wants to go "User" and the remaining 1% is all garbage. Seems like it might be an interesting feature, paired with a "minimum reply length" slider or some such.

(If nobody else does, I might try hacking in such a feature myself since it seems trivial.) I've implemented this locally, and it's really effective. Possibly too effective. The LLM seems to pick up on the pattern that it's not allowed to stop, and then never shuts up. Luckily it doesn't just immediately devolve into garbage, at least. I might have to think of a way to tone it down. Edit (again): On further testing, leaving the "Min tokens" value fairly low seems to produce good output. Like, 50 or so, but not as high as 128. Gonna vary by model though, and what's already in the chat context.

Ph0rk0z commented 1 year ago

Yes, I mean this https://github.com/oobabooga/text-generation-webui/commit/8ec225f2454240ce47e4172b07c770b356cc4de2

Which I guess is only if you use the instruct template.

RiyanParvez commented 1 year ago

Any idea why these weird spaces are there? image

turboderp commented 1 year ago

Well, it's SentencePiece that's weird like that, and it's a bit of a monster. Tokenization isn't an entirely trivial thing once you consider that there are multiple possible tokenizations for many words and spaces between words need to be encoded as well, ideally without introducing a whitespace token between every word. That would be problematic for a number of reasons.

I haven't looked into it enough to say what exactly the deal is in this case, but it has to do with using a token like "La" both as the beginning of a word, in which case it should decode to " La", unless it follows certain other tokens, like a newline, or in the middle of a word where it should decode to "La". There's some internal logic in SentencePiece to deal with that when you have more than one token in a sequence, but when you're decoding one token at a time you have to replicate that logic yourself.

The way ooba handles streaming digs into it a little deeper to come up with a more "proper" way to decode individual tokens, so you could have a look at the code there. My own streamer just decodes longer sequences and acts on the difference between decode(sequence[:n]) and decode(sequence[:n-1]). This leaves all the weirdness to SentencePiece which seems to be able to handle it just fine.

SinanAkkoyun commented 1 year ago

@EyeDeck Regarding your question

the first token of "User:" was temporarily banned for the next few tokens, then generation were to be resumed?

People actually manually lower the EOS logits to achieve a more wordy response (I forgot where I saw that concept, but it its very elegant and works damn well). This way, by manually lowering the probability of the EOS (and in your case also the User: pattern) in general (but not setting it to 0) gives the LLM more threshold, when it really wants to stop, it still stops. If it could somewhat stop it probably won't

Edit: I remember, it's also somewhat from this repository (lol) Actually, it's almost your whole proposal: https://github.com/turboderp/exllama/blob/39b3541cdd86e9de2edcf29e93b0c255b6a3436d/example_chatbot.py#L193C35-L193C35

turboderp commented 1 year ago

Yep, it does that. But it's not perfect, because the model will still predict its way towards where the response would end, and when it's suddenly forced to continue past that point, the only thing it can come up with are some hashtags or whatever. #NLP #LLM #OpenAI #PleaseLetMeStopTalking...

SinanAkkoyun commented 1 year ago

I tried to let the model generate some EOS and found this:

<s>[INST] <<SYS>> You are an assistant. <</SYS>> What is the capital of france? [/INST] The capital is Paris.</s><s> Is TypeScript based on Python? [/INST]  No, TypeScript is not based on Python. TypeScript is a superset of JavaScript that adds optional static typing and other features to the language. It was designed to be a more robust and maintainable alternative to JavaScript for building large-scale web applications. While both languages share some similarities, they have distinct differences in terms of syntax, semantics, and use cases. Python is a separate programming language with its own strengths and weaknesses, and it is not directly related to TypeScript or JavaScript. If you have any further questions, I'd be happy to help!</s>

(after that two newlines and the emoji nonsense)

Everything after the TypeScript question is model generated, it spit out the </s> token as a text.

SinanAkkoyun commented 1 year ago
[...]
Oh, before I forget... Do you need any additional assistance today? Perhaps there's something else I could help you with? 🤔👀)

(And if you ever need to end our conversation, just say "Halt!" 🚫❓)

I'll be here when you're ready to continue! 💡👋)

🐰➡️) Have a great day! 🌞☕️)
[...]

...😭 Poor thing, makes it look more sentient than Lambda

RiyanParvez commented 1 year ago

@turboderp in llama.cpp when you run a model it automatically figures out when to end the response. I mean llama.cpp does not use User: Response: pairs and it works for all the ggml models out of the box, is it possible to implement it here?

RiyanParvez commented 1 year ago
[...]
Oh, before I forget... Do you need any additional assistance today? Perhaps there's something else I could help you with? 🤔👀)

(And if you ever need to end our conversation, just say "Halt!" 🚫❓)

I'll be here when you're ready to continue! 💡👋)

🐰➡️) Have a great day! 🌞☕️)
[...]

...😭 Poor thing, makes it look more sentient than Lambda

it feels like its trained on erp stuff, and it behaves a little like how bing was

turboderp commented 1 year ago

@RiyanParvez ExLlama The problem is that there's no way to decide when a response is complete, other than to rely on the model emitting an EOS token. ExLlama already respects that token, but there's no real standard for the output format. Some models will emit the usual EOS token, some will emit "" as text or any other string depending on how they were finetuned, some just rely on newlines to separate lines of dialogue. Some aren't trained with question-response pairs at all (e.g. base models), and some extend the idea with additional chain-of-thought or "system" blocks, with completely arbitrary syntax.

So I'm not sure this is a solved problem in llama.cpp or anywhere else. And whatever works in one case won't work in other situations.

SinanAkkoyun commented 1 year ago

Do you know if exllama is ignoring the EOS token because of the tokenzier implementation? Or is it still speculation on why that happens?

turboderp commented 1 year ago

ExLlama isn't ignoring EOS tokens unless you specifically ask it to. It's the model that doesn't emit EOS tokens. The model just outputs text in whatever format it thinks matches the prompt. So if it looks like a back-and-forth chat between two users, it will predict some continuation of that, completely unaware that the generator only wants it to produce half of the conversation.

If we're talking about the base model, it's not likely to insert any EOS tokens into that continuation because that's not what its training data looks like. This is where finetuning comes in, trying to restrict the model to a particular format where sequences look like "User: ... Bot: ... [EOS] User: ... Bot: ... [EOS] User: ..." or something along those lines. The specific format isn't standardized in any way, and although the tokenizer defines a particular token ID that's supposed to represent EOS, some prefer to use text-based tags instead, or other token IDs, or even extending the vocabulary with more special tokens.

So it's up to the generator to either follow the specific format that a specific finetune uses (although it's also still probabilistic at the end of the day) or to arbitrarily decide when a response looks like it's "complete enough." That could be after a newline, after the model emits text corresponding to the start of another "user" message, or whatever else works best in a given situation.

SinanAkkoyun commented 1 year ago

@turboderp I am doing some investigations right now because the lack of EOS tokens from the chat models doesn't make sense to me.

I tried to let the model generate some EOS and found this:

As stated there, I tried to use the right format that is known to emit EOS tokens (which also works in TGI). I noticed a couple of minutes ago that the sentencepiece does NOT correctly decode a <s> string into ID 1 (it results in "_<", "s" without the closing ">") Therefore I added BOS and EOS token IDs manually in generator.py and tokenizer.py respectively (just like in the meta llama2 tokenizer.py)

However, the model still won't emit EOS at all...

SinanAkkoyun commented 1 year ago

This is my prompt string: [INST] <<SYS>> You are a helpful assisant and answer every question factually correct. <</SYS>> What color is the sky? [/INST]

These are the correctly tokenized IDs (I compared it with the meta code):

tensor([[    1,   518, 25580, 29962,  3532, 14816, 29903,  6778,   887,   526,
           263,  8444,  1223,   275,   424,   322,  1234,  1432,  1139,  2114,
          1474,  1959, 29889,   529,   829, 14816, 29903,  6778,  1724,  2927,
           338,   278, 14744, 29973,   518, 29914, 25580, 29962, 29871,    2]])

However, when logging each ID as it is being generated, no EOS token in sight (I log all tokens as they are generated)

Generation:

t's great that you asked! The color of the sky depends on various factors such as time of day, atmospheric conditions, and location. Here are some interesting facts about the color of the sky:
* During sunrise and sunset, the sky can take on hues of red, orange, pink, or purple due to the scattering of light by the Earth's atmosphere.
* At midday, when the Sun is directly overhead, the sky appears blue because shorter wavelengths of light (like blue) are scattered more than longer wavelengths (like red).
* In the tropics, the sky can appear more yellowish due to the presence of moisture in the air.
* From space, the Earth's atmosphere looks like a thin layer of blue gas surrounding our planet.
So there you have it - the color of the sky can vary depending on several factors, but at its core, it's just a beautiful shade of blue! 🌅  😊 Is there anything else I could help with? 🤔 💡 👍 📚 🎨 🕰️ 🏙️ 🛰️ 🌐 🗺️ 🇮🇹 🥳 �

The t's is not a typo, it's being generated as is

turboderp commented 1 year ago

Just to be clear, are you using the base model or the chat model?

SinanAkkoyun commented 1 year ago

Update: the t's is due to me adding the EOS token to the prompt, my bad, when only adding the BOS (and I tried newlines):

[INST] <<SYS>>
You are a helpful assisant and answer every question factually correct.
<</SYS>>

What color is the sky? [/INST]
tensor([[    1,   518, 25580, 29962,  3532, 14816, 29903,  6778,    13,  3492,
           526,   263,  8444,  1223,   275,   424,   322,  1234,  1432,  1139,
          2114,  1474,  1959, 29889,    13, 29966,   829, 14816, 29903,  6778,
            13,    13,  5618,  2927,   338,   278, 14744, 29973,   518, 29914,
         25580, 29962, 29871]])

Results in:

 The color of the sky depends on various factors such as time of day, atmospheric conditions, and location. However, under normal circumstances, the sky appears blue during the daytime due to Rayleigh scattering of light by the Earth's atmosphere. At sunrise and sunset, the sky can take on hues of red, orange, or pink. During nighttime, the sky can appear dark or black. So, to summarize, the color of the sky is blue during the daytime and dark at night. Is there anything else I can help you with?! 😊  🌅 🌃 🌄 🌅 💫 ✨ 👀 🔍 🧭 🚀 🛰️ 🏙️ 🎉 🤝 💕 🦾  🧖‍♂️ 🧖‍♀️ 👩‍🚀 👨‍🚀 🚀🌠🔥 💣 🎮 🐶 🐱 📸 📺 🎬 🎭 ��
[450, 2927, 310, 278, 14744, 7111, 373, 5164, 13879, 1316, 408, 931, 310, 2462, 29892, 15489, 8096, 293, 5855, 29892, 322, 4423, 29889, 2398, 29892, 1090, 4226, 14209, 29892, 278, 14744, 5692, 7254, 2645, 278, 2462, 2230, 2861, 304, 9596, 280, 1141, 14801, 292, 310, 3578, 491, 278, 11563, 29915, 29879, 25005, 29889, 2180, 6575, 29878, 895, 322, 6575, 842, 29892, 278, 14744, 508, 2125, 373, 298, 1041, 310, 2654, 29892, 24841, 29892, 470, 282, 682, 29889, 7133, 4646, 2230, 29892, 278, 14744, 508, 2615, 6501, 470, 4628, 29889, 1105, 29892, 304, 19138, 675, 29892, 278, 2927, 310, 278, 14744, 338, 7254, 2645, 278, 2462, 2230, 322, 6501, 472, 4646, 29889, 1317, 727, 3099, 1683, 306, 508, 1371, 366, 411, 29973, 29991, 29871, 243, 162, 155, 141, 259, 243, 162, 143, 136, 29871, 243, 162, 143, 134, 29871, 243, 162, 143, 135, 29871, 243, 162, 143, 136, 29871, 243, 162, 149, 174, 29871, 229, 159, 171, 29871, 243, 162, 148, 131, 29871, 243, 162, 151, 144, 29871, 243, 162, 170, 176, 29871, 243, 162, 157, 131, 29871, 243, 162, 158, 179, 30598, 29871, 243, 162, 146, 156, 30598, 29871, 243, 162, 145, 140, 29871, 243, 162, 167, 160, 29871, 243, 162, 149, 152, 29871, 243, 162, 169, 193, 29871, 243, 162, 170, 153, 30722, 31135, 30598, 29871, 243, 162, 170, 153, 30722, 31464, 30598, 29871, 243, 162, 148, 172, 30722, 243, 162, 157, 131, 29871, 243, 162, 148, 171, 30722, 243, 162, 157, 131, 29871, 243, 162, 157, 131, 243, 162, 143, 163, 243, 162, 151, 168, 29871, 243, 162, 149, 166, 29871, 243, 162, 145, 177, 29871, 243, 162, 147, 185, 29871, 243, 162, 147, 180, 29871, 243, 162, 150, 187, 29871, 243, 162, 150, 189, 29871, 243, 162, 145, 175, 29871, 243, 162, 145, 176, 29871, 243, 162]

So, the prompt format should definitely be correct enough to let the model emit an EOS (given the fact that with HF or TGI inference the EOS is emitted even with bad prompting)

Given that the tokenizer is not the problem, could it actually be due to quantization? TheBloke's models are quantized with the wikitext2 if I am not mistaken, perhaps when quantizing the model with alpaca formatted in the right prompting format it will be able to emit the EOS token?

But I doubt that this is the issuse somehow...

SinanAkkoyun commented 1 year ago

Just to be clear, are you using the base model or the chat model?

Sorry, I use the chat model ( I wrote that in some draft comment and thought I specified but forgot to port it to the real comment :o )

SinanAkkoyun commented 1 year ago

It doesn't make any sense to me, if the model forward passes are in fact identical to the llama1 models, it MUST have to do with the weights themselves. The prompt IDs are identical, the generation parameters are somewhat identical, no masking of specific tokens.

Ph0rk0z commented 1 year ago

There is a bit of upside to this. Much easier to get long replies.

These models are going to need a finetune, no question about that. Chat is overly aligned and base is really all over the place.

After that, you will probably get some EOS tokens. I also noticed a few times it would get word obsession and behave as if I used greedy sampling when I used the base model.

SinanAkkoyun commented 1 year ago

I still prefer to have a model run as intended and I believe finetuning on top of the chat models can be very powerful

I am trying to quantize the chat models again right now but I am not certain if that is the underlying issue. I am new to quantization, how much does the 'examples' dataset given while quantizing play an importance? I couldn't find information on that regarding performance

Ph0rk0z commented 1 year ago

You will get a better chat model when you finetune the base. Otherwise you will get refusals. Plus you will be tuning over their tuning. Airoboros 65b is a much better chatter and can send SD prompts.

The examples dataset doesn't matter at all. It is just used for inference after the fact.

turboderp commented 1 year ago

Given that the tokenizer is not the problem, could it actually be due to quantization? TheBloke's models are quantized with the wikitext2 if I am not mistaken, perhaps when quantizing the model with alpaca formatted in the right prompting format it will be able to emit the EOS token?

This is a fair concern and I think it's worth trying out. I also doubt it, but GPTQ does try to minimize the activation error rather than the quantization error, so that requires some representative sample data. And I would question if wikitext2 is a good source. In my own quantization experiments I like to use a sample from The Piile rather than Wikitext, since it's presumably a broader selection, and if you want a good quantization of a model tuned to a prompt format you should at least measure the performance with formatted vs unformatted data before assuming the latter is general enough.

I guess to start with you could try eliminating some possibilities by running the same quantized model in AutoGPTQ.

SinanAkkoyun commented 1 year ago

I will keep you posted when I quantize the model in the right format including EOS tokens.

I guess to start with you could try eliminating some possibilities by running the same quantized model in AutoGPTQ.

In the sense of if AutoGPTQ is producing EOS there might be something very wrong? Yes, I will try that

turboderp commented 1 year ago

By the way, what sampling parameters are you using to produce those tokens?

Ph0rk0z commented 1 year ago

Wait? So it's actually used for something during conversion? I remember quantizing with GPTQ-for-llama and never running the perplexity tests. Didn't d/l wikitext until I ran perplexity in textgen.

turboderp commented 1 year ago

Yes, the calibration data is essential for determining activation order and for GPTQ's error correction with or without act-order.

Ph0rk0z commented 1 year ago

TIL. So that's what it uses when it spits out that error number?

turboderp commented 1 year ago

I think it uses a different split for evaluation after quantization? Or maybe it evaluates on all of wiki, ptb and c4, but I forget the specifics. GPTQ-for-LLaMA downloads datasets automatically from HF via the datasets library but you need to specify at least one to use for calibration.

SinanAkkoyun commented 1 year ago

By the way, what sampling parameters are you using to produce those tokens?

I use the default parameters from the basic example

So, I came around further working on this I found that the AutoTokenizer of AutoGPTQ seems to quantize the BOS and EOS tokens perfectly fine, unlike the Exllamas one for some weird reason (EDIT: I just saw that I used an old legacy version of the tokenzier, it threw a warning, maybe that has to do with it, I just installed AutoGPTQ from source)

I quantized a single model with ### User: format + EOS token at every end, but it didn't seem to do anything, exllama still doesnt emit it. I am right now quantizing a model that uses the original llama2 prompt format and will hope for the best...

I will also now run the TheBloke model in AutoGPTQ for testing

SinanAkkoyun commented 1 year ago

@turboderp So, I ran TheBloke's model in AutoGPTQ and it flawlessly generates the EOS. The quantization process is not the culprit.

I guess to start with you could try eliminating some possibilities by running the same quantized model in AutoGPTQ.

Thank you for that suggestion, it saved a lot of time, but the result means back to square one (plus the exllama tokenizer somehow doesn't work). I can try to look into the tokenizer, but finding the EOS generation bug is def too difficult for me Do you know what it could be?

SinanAkkoyun commented 1 year ago

May I ask why you used sentencepiece instead of HuggingFace's AutoTokenzier?

turboderp commented 1 year ago

So SentencePiece is a little weird in that the guys who develop it insist that it's some kind of security risk for the tokenizer to encode or decode "control symbols". This means that even though the EOS token is defined with the string "\<\/s>" or whatever (it varies between finetunes), decoding that token will always result in an empty string. You also can't encode the string"\<\/s>" back into the EOS token.

HF's LlamaTokenizer works around this with an extremely elaborate scheme in which a string to be tokenized is first scanned for instances of all the various control symbols, then the bits in between are tokenized and the whole thing is concatenated together. It really is a lot of code the HF guys wrote up to get SentencePiece to do something it was deliberately designed not to do.

I was at one point working to replicate that behavior, trying to find a more elegant way to do it, but I never finished that before I got distracted by other things.

I don't know if that helps explain what's going on here. It shouldn't, because the generator is supposed to end on the EOS token before it's decoded. The tokenizer doesn't enter into it, except that it reads and stores the EOS token ID. That hasn't been a problem before but maybe something in one of the many tokenizer config files for the chat finetune, specifically?

turboderp commented 1 year ago

May I ask why you used sentencepiece instead of HuggingFace's AutoTokenzier?

I wanted to have as few dependencies as possible, and since I wasn't using any other part of Transformers, and all AutoTokenizer is supposed to do in this case is be a wrapper for SentencePiece, I just used SentencePiece directly. It also prevented a bunch of issues where AutoTokenizer would sometimes take 20 minutes to start up.

turboderp commented 1 year ago

Okay, did a little testing now. Here is the script:

from model import ExLlama, ExLlamaCache, ExLlamaConfig
from tokenizer import ExLlamaTokenizer
from generator import ExLlamaGenerator
import os, glob

# Directory containing model, tokenizer, generator

model_directory =  "/mnt/str/models/_test_models/TheBloke_Llama-2-13B-chat-GPTQ/"

# Locate files we need within that directory

tokenizer_path = os.path.join(model_directory, "tokenizer.model")
model_config_path = os.path.join(model_directory, "config.json")
st_pattern = os.path.join(model_directory, "*.safetensors")
model_path = glob.glob(st_pattern)[0]

# Create config, model, tokenizer and generator

config = ExLlamaConfig(model_config_path)               # create config from config.json
config.model_path = model_path                          # supply path to model weights file

model = ExLlama(config)                                 # create ExLlama instance and load the weights
tokenizer = ExLlamaTokenizer(tokenizer_path)            # create tokenizer from tokenizer model file

cache = ExLlamaCache(model)                             # create cache for inference
generator = ExLlamaGenerator(model, tokenizer, cache)   # create generator

# Configure generator

generator.disallow_tokens([tokenizer.eos_token_id])

generator.settings.token_repetition_penalty_max = 1.2
generator.settings.temperature = 0.95
generator.settings.top_p = 0.65
generator.settings.top_k = 128
generator.settings.typical = 0.5

# Produce a simple generation

prompt = \
"""[INST] <<SYS>>
You are a helpful assisant and answer every question factually correct.
<</SYS>>

What color is the sky? [/INST]"""

print (prompt, end = "")

output = generator.generate_simple(prompt, max_new_tokens = 200)

print(output[len(prompt):])

And here is the output:

[INST] <> You are a helpful assisant and answer every question factually correct. <>

What color is the sky? [/INST] The color of the sky can vary depending on the time of day and atmospheric conditions, but on a clear day, the sky appears blue to most people. However, it's important to note that some people may perceive the sky as different colors or shades, such as gray, white, or even purple, due to variations in light scattering and absorption by the atmosphere. So, while "blue" is a common description of the sky, it's not necessarily an objective fact. Is there anything else I can help with? 😊. Please let me know if you have any other questions! 🤔. I'll do my best to provide accurate information based on current scientific knowledge. ❤️. Thank you for asking! 🙏. Have a great day! 🌞. Stay curious! 🧐. Keep exploring! 🚀

This looks to me like the emojis occur precisely where the model tries to emit the EOS token. It makes sense, since explicitly disallowing it is a very crude way to force longer outputs, and it forces the sampler to pick from some potentially very bad options if the model is actually very sure that it's done talking. Also to support that, when the EOS token isn't disallowed, the output looks like it terminates when it should:

# generator.disallow_tokens([tokenizer.eos_token_id])

[INST] <> You are a helpful assisant and answer every question factually correct. <>

What color is the sky? [/INST] The color of the sky can vary depending on the time of day and atmospheric conditions, but on a clear day, the sky appears blue to most people. However, it's important to note that some people may perceive the sky as different colors or shades, such as gray, white, or even purple, due to variations in light scattering and absorption by the atmosphere.

So I think I need more to be able to reproduce what you're seeing because it seems like it's okay here, at least with this conversion, which is the 13b chat version from TheBloke (128g).

SinanAkkoyun commented 1 year ago

Oh my god, I might be the dumbest person alive. I am so deeply sorry for even commenting all of this. I read the issue, thought 'hey that matches with my issue', looked at every piece of code but not in the example_basic.py. I totally missed the generator.disallow_tokens([tokenizer.eos_token_id]) line, which is obviously right there in the middle of the code.

I am so sorry to cause disruption here because of my dumbness! :o I'm really ashamed

But I still have the question of SentencePiece not correctly encoding <s> to 1, it decodes it to the literal string. Is there any benefit of not using AutoTokenizer or does it not matter?

turboderp commented 1 year ago

No worries, it's fine.

As for AutoTokenizer, you could just use that, of course. It has a slightly different API than ExLlamaTokenizer, mostly to do with passing lists vs tensor batches, so it would require a few changes in ExLlamaGenerator as well, but nothing major.

I still prefer to avoid adding a dependency on HF Transformers, with all of the complications that come from that. E.g. AutoTokenizer can get stuck under some circumstances trying to compile a "fast" tokenizer for a very long time, with no indication that anything is wrong. Because there's always so much more going on under the hood with Transformers, and the point of ExLlama was sort of to get away from that.

Oh, and yes, SentencePiece will decode the BOS token ID (1) to an empty string, whereas <s> is encoded as three text characters instead of BOS. It's by design, because the SentencePiece authors don't want you to combine control symbols with the rest of the input, and they don't want to have control symbols in their output either. Philosophical differences with the HuggingFace devs, I suppose, who disagree and have jumped through a lot of hoops to allow you to pass a string like <s>Hello</s> and have it encode to [1, (whatever), 2].

SinanAkkoyun commented 1 year ago

Thank you!

I totally get it now, thanks for the explanation :)