turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.2k stars 236 forks source link

Increase context length #322

Closed virentakia closed 2 weeks ago

virentakia commented 5 months ago

The inference speed is amazing - excellent work.

Is it possible to increase the context length of models?

Using the "Solar" model - https://huggingface.co/bartowski/Nous-Hermes-2-SOLAR-10.7B-exl2/tree/8_0 - with the following config:

{
  "_name_or_path": "upstage/SOLAR-10.7B-v1.0",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 1,
  "eos_token_id": 32000,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 4096,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 48,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_theta": 10000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.37.0.dev0",
  "use_cache": false,
  "vocab_size": 32002
}

Wondering what possible options are available? Are there examples of these options available?

DocShotgun commented 5 months ago

There's no magic solution to increasing context length beyond what the model was designed for. You can load the model with double its base context length and get it to remain coherent by setting rope alpha = ~2.63. However the further you try to stretch a model beyond the context length it was designed for, the further its quality will degrade.