How to use >2k context size?

vadi2 commented 1 year ago

From reading https://huggingface.co/TheBloke/wizard-vicuna-13B-SuperHOT-8K-GPTQ I get the sense that exllama supports context sizes greater than 2k - as part of oogabooga, anyhow.

When I load the model in the webui the sequence length shown is 2048, and trying a prompt larger than 2048 causes the python process to peg one cpu core indefinitely:

Is a greater context size supported in the exllama webui as well?

EyeDeck commented 1 year ago

For that model, you'd launch with -cpe 4 -l 8192 (or --compress_pos_emb 4 --length 8192), possibly reducing length if you're VRAM limited and start OOMing once context has grown enough. Some instructions that are going around say you can use e.g. -cpe 2 -l 4096 (e.g. for 33B on 24GB VRAM, which OOMs around 3400-3600 tokens anyway), but you shouldn't do that, cpe should be set equal to whatever the finetune was done with or the model will act stupid, regardless of what length is limited to.

You can also try NTK scaling by setting e.g. -a 4 (--alpha 4; use with --length, and instead of -cpe), for that you do not want one of the older linear scaling finetunes, just use a regular 2k model. Without a finetune, -a 2 will go up to ~3400 length before exploding, -a 4 will do ~5500, -a 8 will do ~9000, and beyond that I don't know. Finetuning also helps with NTK scaling, but it's not strictly necessary like it is for linear scaling. At the time of writing, the only NTK finetunes I know of are a SuperHOT 8k LoRA and Airoboros 16k, but this tech has only been around for a few days, so expect more of those in the future. Also, these "-a vs context" numbers will probably be outdated soon, as exllama still uses a version of NTK scaling as it was originally introduced, which has already been significantly improved upon here, and someone will probably port these changes here sooner or later.

Ph0rk0z commented 1 year ago

Another vote for NTK. No tuning and better perplexity. Covers basically the reasonable range for a 30b in 24gb, even at the lowest setting.

vadi2 commented 1 year ago

Thanks! Larger context size with -cpe -l worked.

What model would you recommend that I try with NTK?

Will close the comment as we got the larger context size working.

EyeDeck commented 1 year ago

What model would you recommend that I try with NTK?

I'm not particularly knowledgeable on this, but every model I've tried with NTK scaling isn't that notably different from the same model without NTK scaling, except that scaled models are more prone to hallucination until you get a larger chunk of context going. Personally I kind of like stock LLaMA the most overall, but the same applies at regular 2k context (with an acknowledgement that that seems to be a minority opinion, and instruct tuned models are more popular; of those my favorite's been WizardLM-33B-V1.0-Uncensored).

turboderp / exllama

How to use >2k context size? #147