Closed vadi2 closed 1 year ago
For that model, you'd launch with -cpe 4 -l 8192
(or --compress_pos_emb 4 --length 8192
), possibly reducing length if you're VRAM limited and start OOMing once context has grown enough.
Some instructions that are going around say you can use e.g. -cpe 2 -l 4096
(e.g. for 33B on 24GB VRAM, which OOMs around 3400-3600 tokens anyway), but you shouldn't do that, cpe
should be set equal to whatever the finetune was done with or the model will act stupid, regardless of what length
is limited to.
You can also try NTK scaling by setting e.g. -a 4
(--alpha 4
; use with --length
, and instead of -cpe
), for that you do not want one of the older linear scaling finetunes, just use a regular 2k model. Without a finetune, -a 2
will go up to ~3400 length before exploding, -a 4
will do ~5500, -a 8
will do ~9000, and beyond that I don't know.
Finetuning also helps with NTK scaling, but it's not strictly necessary like it is for linear scaling. At the time of writing, the only NTK finetunes I know of are a SuperHOT 8k LoRA and Airoboros 16k, but this tech has only been around for a few days, so expect more of those in the future.
Also, these "-a
vs context" numbers will probably be outdated soon, as exllama still uses a version of NTK scaling as it was originally introduced, which has already been significantly improved upon here, and someone will probably port these changes here sooner or later.
Another vote for NTK. No tuning and better perplexity. Covers basically the reasonable range for a 30b in 24gb, even at the lowest setting.
Thanks! Larger context size with -cpe -l worked.
What model would you recommend that I try with NTK?
Will close the comment as we got the larger context size working.
What model would you recommend that I try with NTK?
I'm not particularly knowledgeable on this, but every model I've tried with NTK scaling isn't that notably different from the same model without NTK scaling, except that scaled models are more prone to hallucination until you get a larger chunk of context going. Personally I kind of like stock LLaMA the most overall, but the same applies at regular 2k context (with an acknowledgement that that seems to be a minority opinion, and instruct tuned models are more popular; of those my favorite's been WizardLM-33B-V1.0-Uncensored).
From reading https://huggingface.co/TheBloke/wizard-vicuna-13B-SuperHOT-8K-GPTQ I get the sense that exllama supports context sizes greater than 2k - as part of oogabooga, anyhow.
When I load the model in the webui the sequence length shown is 2048, and trying a prompt larger than 2048 causes the python process to peg one cpu core indefinitely:
Is a greater context size supported in the exllama webui as well?