turboderp / exllama

A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.
MIT License
2.74k stars 215 forks source link

Increased context length with NTK Rope Scaling #158

Open juanps90 opened 1 year ago

juanps90 commented 1 year ago

I am having bad quality results with prompts longer than 2048 tokens with a LoRA trained with alpaca_lora_4bit.

These are the settings I am using:

config = ExLlamaConfig(model_config_path)               # create config from config.json
config.model_path = model_path                          # supply path to model weights file
config.alpha_value = 2
config.max_seq_len = 4096

config.gpu_peer_fix = True
config.set_auto_map("10,24")

I tried with higher values of alpha_value and max_seq_len, as well as temperature and similar values but it still fails. It doesn't fail with shorter sequences, so it seems to be an issue with the extended context. Using this configuration, short sequences work fine with the LoRA, but longer sequences just output garbage.

Panchovix commented 1 year ago

Above 2048 context you shouldn't have any issues up to ~3400 context with Static NTK RoPE Scaling, I use it like that on 65b at least.

Now about a lora itself, I'm not sure. I have been using it on base models and finetuned NTK model (https://huggingface.co/bhenrym14/airoboros-33b-gpt4-1.4.1-NTK-16384-GPTQ)

Based on some tests at least, it should be like this.

ppls ntkv2

Linear has the value inverted on exllama (embedding compression)

juanps90 commented 1 year ago

I am using Neko-Institute-of-Science_LLaMA-30B-4bit-128g with no context scaling training at all. As I understand, NTK RoPE Scaling does not require any finetuning at all, unlike SuperHOT.

Am I setting the NTK RoPE parameters correctly?

Update: Loading the LoRA with this model and switching from alpha_value to compress_pos_emb works a LOT better.

EyeDeck commented 1 year ago

I think you need to call config.calculate_rotary_embedding_base() with the current way RoPE NTK scaling is implemented for the settings to properly take effect. Make sure config.alpha_value is already set when you do.

juanps90 commented 1 year ago

I think you need to call config.calculate_rotary_embedding_base() with the current way RoPE NTK scaling is implemented for the settings to properly take effect. Make sure config.alpha_value is already set when you do.

Thanks a lot! Works wonders with the stock 30B LLaMA model!

juanps90 commented 1 year ago

I'm having a weird issue where it just skips or adds digits to numbers. For example, if there's a phone number in the prompt, the generated text may add another digit to it, or maybe skip one of the digits.

It's also displaying for example $1.6280 when it should display $1.628

Has anyone noticed this? The generated text looks solid but the numbers seem to be garbled.

Single or double digits seem fine

EyeDeck commented 1 year ago

I've seen that effect while running a linear-scaled LoRA (SuperHOT or Airoboros 8k or 16k) with the wrong compress_pos_emb value. If it's set to anything other than what it was trained on (typically 4 for 8k or 8 for 16k) it causes brain damage, which is usually fairly subtle except when numbers are involved, and then it almost always screws them up. Haven't seen that happen with NTK scaling though.

juanps90 commented 1 year ago

I've seen that effect while running a linear-scaled LoRA (SuperHOT or Airoboros 8k or 16k) with the wrong compress_pos_emb value. If it's set to anything other than what it was trained on (typically 4 for 8k or 8 for 16k) it causes brain damage, which is usually fairly subtle except when numbers are involved, and then it almost always screws them up. Haven't seen that happen with NTK scaling though.

Thank you. I am using Neko LLaMA 30B with a LoRA trained on it. Using only alpha_value and no compress_pos_emb.

The results with NTK appear to be much better than PI, though it's having issues with numbers. Will try a different model and check the code once I'm back home.

juanps90 commented 1 year ago

Well, LLaMA v2 13B GPTQ from The-Bloke goes NUTS after I do:

config = ExLlamaConfig(model_config_path)               # create config from config.json
config.model_path = model_path                          # supply path to model weights file

config.alpha_value = 2
config.compress_pos_emb = 1
config.max_seq_len = 8192
config.calculate_rotary_embedding_base()

If alpha_value = 1 and max_seq_len = 4096 (model's native length), the outputs are perfect with the LoRA applied.

EyeDeck commented 1 year ago

NTKv1 alpha=2 won't get you 2x context, try like alpha=2.6. I just picked that number arbitrarily and tested it to work, there's almost definitely a more optimal value >2 and <2.6, but you'd have to trial-and-error it.

NTK-by-parts (which the Transformers devs had proposed to short-hand to NTKv2, but that might mean dynamic NTK-by-parts now, which is what Transformers ultimately implemented) is supposed to correct this, so a scaling value of 2 = 2x, and 4 = 4x and so on; anecdotally I tried implementing it (the non-dynamic version) in ExLlama and while the code ran, and ppl tested a little lower, actual generation was definitely a little off. Weird token repetition and stuff that I never managed to debug. Turboderp is working on it though, I'm much more confident in his ability than mine. #174

juanps90 commented 1 year ago

I understand that alpha=2 should still allow for 4.5k or 5k token length (which it was failing to do), right? Also, I wonder what the relationship between alpha_value and max_seq_len is? Can you just do max_seq_len=8192 with alpha_value=2.6 or alpha_value=2.7 or any other number just like that?

I was under the impression (probably getting confused with cpe) that alpha_value * native_max_seq_len = max_seq_len (even if it would go off the rails with less tokens that max_seq_len), but it seems from your message that it will work even if a value of 2.6 is provided and this will work, just with a specific maximum context length that's less than max_seq_len ?

EyeDeck commented 1 year ago

I was under the impression (probably getting confused with cpe) that alpha_value * native_max_seq_len = max_seq_len

See the chart in https://github.com/turboderp/exllama/issues/158#issuecomment-1637195097 compress_pos_emb = Linear, except it's inverted (1/n), and alpha_value = NTK and probably multiply everything by 2 for LLaMA v2. Also I'm pretty sure that chart was comparing the same SuperHOT 8k (or might've been 13B 16k?) finetune for all the "Linear" lines, to some regular LLaMA v1 2k model with NTK scaling applied for the NTK lines. Which isn't a fair comparison for two reasons: one, linear scaled finetunes only work properly with the same linear value they were finetuned on, despite what perplexity metrics say; and two, it's possible to finetune for NTK too.

Anyway I've only tested this specific quant (LLaMA 2 13B, no finetune, 4-bit 128g act-order), and with alpha_value=2, it seems to be good until ~6800, then starts devolving into incoherence and eventually noise. Not sure what's up if alpha_value=2 doesn't even get to 4.5k for you.

Also, not sure if you've got new LoRAs or not, but keep in mind that all LoRAs for LLaMA v1 aren't compatible with v2 since it's a complete retrain, at best they'll do nothing, at worst they'll cause brain damage.

Yes, you can run with whatever numbers you want, max_seq_len just controls some memory allocation stuff and when to start throwing out old context; and alpha_value controls how high you can go before the model goes nuts. So e.g. alpha_value=2, max_seq_len=2048, is exactly the same as alpha_value=2, max_seq_len=4096 between 0-2048 tokens, after that max_seq_len=2048 starts truncating the oldest tokens, while max_seq_len=4096 waits for twice as long before truncating, and so on.

juanps90 commented 1 year ago

Thank you for your reply. Yes, the LoRA is freshly trained on v2 and works great up to 4k.

Have you tried using a LoRA with ntk and exllama?

EyeDeck commented 1 year ago

Well, I just tried loading this LoRA (first LLaMA 2 LoRA I could find on HF) on top of this quant, using an alpha value of 6 and max_seq_len of 16384. Then I gave it the first 9 pages of The Hobbit (11899 tokens) and let it go for 6800 tokens, up to a total token count of 18699, where of course towards the end the first few thousand tokens had fallen off. Here's the output (with some linebreaks and a === inserted after generation, to separate the original text). I can't vouch for the quality of the text, but it's definitely coherent.