Set n_ctx for llama.cpp models when loading/reloading

digiwombat commented 1 year ago

Currently, n_ctx is locked to 2048, but with people starting to experiment with ALiBi models (BluemoonRP, MTP whenever that gets sorted out properly) and RedPajamas talking about hyena and StableLM aiming for 4k context potentially, the ability to bump context numbers for llama.cpp models is going to be something very useful to have going forward. Especially since most of those models are likely to be run on CPU for most consumer hardware people.

I also think expected behavior is that whatever context limit I'm setting in the UI should be passed through to the inference backend. I think requiring a model reload is fine for changing the setting, but it should pass through the value when a ggml model is loading. A "reload model on context size change" setting could be nice to have if there's a clean spot for it though, assuming it would be useful for more than just ggml files. Maybe instead of a checkbox, just a convenient button that pops up after it's changed to cue people to reload the model, since knowing when the user is done adjusting context is hard and reloading is fairly heavy.

Likewise, I think --n_ctx should be a flag that can be set for people who want to automate sh/bat loading of larger context models.

LaaZa commented 1 year ago

Correct me if I'm wrong but doesn't textgen use specifically llama.cpp via llama-cpp-python, which are for LLaMA? Maybe it could be useful to be able to change this value anyway though.

digiwombat commented 1 year ago

I'm pretty sure there was recently a merge for llama.cpp that added loading for GPT NeoX models, I may have misread that somewhere. It might not be in yet. Though it is clear supporting GPT NeoX in llama.cpp isn't being considered a different-project sort of thing.

Discussion here: https://github.com/ggerganov/llama.cpp/issues/1063 And MPT here: https://github.com/ggerganov/llama.cpp/issues/1333

And BluemoonRP 13B is a llama model with ALiBi support baked in (or however that should be phrased) so is loadable and usable today in base llama.cpp without changes.

oobabooga / text-generation-webui

Set n_ctx for llama.cpp models when loading/reloading #1872