Open ShahZ181 opened 1 year ago
ime airoboros doesn't use compress_pos_embed and I found the best perplexity was obtained using alpha 2.7. the default 100k base gives lower results when I ran it as a lora.
8k has worked for me like that.
What params do I need to change to support longer context? Appreciate if someone could point me to where in the docs I can find this out.
I'm getting:
RuntimeError: start (2048) + length (1265) exceeds dimension size (2048).
when I run:
# Create config, model, tokenizer and generator
config = ExLlamaConfig(model_config_path) # create config from config.json
config.model_path = model_path # supply path to model weights file
model = ExLlama(config) # create ExLlama instance and load the weights
tokenizer = ExLlamaTokenizer(tokenizer_path) # create tokenizer from tokenizer model file
cache = ExLlamaCache(model) # create cache for inference
generator = ExLlamaGenerator(model, tokenizer, cache) # create generator
# Configure generator
generator.disallow_tokens([tokenizer.eos_token_id])
generator.settings.token_repetition_penalty_max = 1.2
generator.settings.temperature = 0.05
generator.settings.top_p = 0.65
generator.settings.top_k = 100
generator.settings.typical = 0.5
B_INST, E_INST = "[INST]", "[/INST]"
B_SYS, E_SYS = "<<SYS>>\n", "\n<</SYS>>\n\n"
system_prompt="You are a helpful assistant. You are an expert on summarisation."
user_prompt="Provide a three bullet summary of the above content."
prompt = f"{B_INST} {B_SYS}{system_prompt.strip()}{E_SYS}{escaped_transcript}\n\n{user_prompt.strip()} {E_INST}\n\n"
# # Modify the prompt to ask for a summary of escaped_transcript
# prompt = f"Please summarize the following text:\n\n{escaped_transcript}\n\n"
print(prompt, end="")
# Produce a simple generation
output = generator.generate_simple(prompt, max_new_tokens=200)
# Print the generated summary, omitting the prompt
print(output[len(prompt):])
torch.cuda.empty_cache()
There are a couple of parameters in the config (ExLlamaConfig
) related to context length:
max_seq_len
is the main one. It should just be 16384 for a 16k model, if you have the VRAM to hold a cache of that size.max_input_len
and max_attention_size
are used to limit the number of tokens forwarded through the model at once. The input to model.forward
is transparently chunked so this is just a tradeoff between VRAM and speed. If you have VRAM to spare and you want more tokens per second for prompt processing, you could consider increasing it.compress_pos_emb
is the RoPE scaling factor. You would only change this for a model that's finetuned to use a particular value.alpha_value
sets the "NTK" RoPE base, related to what Meta calls "theta" now, for CodeLlama. ExLlama should automatically read the correct theta value from the config file, so you shouldn't need to change this.use_flash_attn_2
may help performance on very long prompts. It hasn't been performing well at all in my tests, but if you're doing things like summaries on 16k inputs, it could be worth trying.
Has anyone gotten 16k context length with codellama or llama2? because i have tried multiple models but they all start producing gibberish when the context window gets past 4096. I am using exllama and i changed all the necessary settings to get it to work but it doesn't work.
I am running python3 data_new.py -d /home/shahrukh/Documents/vicuana-7/models/TheBloke_Airoboros-c34B-2.1-GPTQ -gs 13,13,13,0 --compress_pos_emb 2 -alpha 4 -l 8000
However when using eithe codellama or llama2 the context window cannot be increased past 4096? Does anyone know why that might be??