turboderp / exllama

A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.
MIT License
2.74k stars 215 forks source link

Codellama 16K context length? #261

Open ShahZ181 opened 1 year ago

ShahZ181 commented 1 year ago

Has anyone gotten 16k context length with codellama or llama2? because i have tried multiple models but they all start producing gibberish when the context window gets past 4096. I am using exllama and i changed all the necessary settings to get it to work but it doesn't work.

I am running python3 data_new.py -d /home/shahrukh/Documents/vicuana-7/models/TheBloke_Airoboros-c34B-2.1-GPTQ -gs 13,13,13,0 --compress_pos_emb 2 -alpha 4 -l 8000

However when using eithe codellama or llama2 the context window cannot be increased past 4096? Does anyone know why that might be??

Ph0rk0z commented 1 year ago

ime airoboros doesn't use compress_pos_embed and I found the best perplexity was obtained using alpha 2.7. the default 100k base gives lower results when I ran it as a lora.

8k has worked for me like that.

RonanKMcGovern commented 1 year ago

What params do I need to change to support longer context? Appreciate if someone could point me to where in the docs I can find this out.

I'm getting:

RuntimeError: start (2048) + length (1265) exceeds dimension size (2048).

when I run:

# Create config, model, tokenizer and generator
config = ExLlamaConfig(model_config_path)  # create config from config.json
config.model_path = model_path  # supply path to model weights file

model = ExLlama(config)  # create ExLlama instance and load the weights
tokenizer = ExLlamaTokenizer(tokenizer_path)  # create tokenizer from tokenizer model file

cache = ExLlamaCache(model)  # create cache for inference
generator = ExLlamaGenerator(model, tokenizer, cache)  # create generator

# Configure generator
generator.disallow_tokens([tokenizer.eos_token_id])
generator.settings.token_repetition_penalty_max = 1.2
generator.settings.temperature = 0.05
generator.settings.top_p = 0.65
generator.settings.top_k = 100
generator.settings.typical = 0.5

B_INST, E_INST = "[INST]", "[/INST]"
B_SYS, E_SYS = "<<SYS>>\n", "\n<</SYS>>\n\n"

system_prompt="You are a helpful assistant. You are an expert on summarisation."
user_prompt="Provide a three bullet summary of the above content."

prompt = f"{B_INST} {B_SYS}{system_prompt.strip()}{E_SYS}{escaped_transcript}\n\n{user_prompt.strip()} {E_INST}\n\n"

# # Modify the prompt to ask for a summary of escaped_transcript
# prompt = f"Please summarize the following text:\n\n{escaped_transcript}\n\n"

print(prompt, end="")

# Produce a simple generation
output = generator.generate_simple(prompt, max_new_tokens=200)

# Print the generated summary, omitting the prompt
print(output[len(prompt):])
torch.cuda.empty_cache()
turboderp commented 1 year ago

There are a couple of parameters in the config (ExLlamaConfig) related to context length: