My first approach

model = AutoModelForCausalLM.from_pretrained(
config["server"]["models_dir"] + config["server"]["default_model_file"],
model_type='llama2',
max_new_tokens=config["ai"]["max_new_tokens"],
context_length=config["ai"]["context_length"],
temperature=config["ai"]["temperature"],
repetition_penalty=config["ai"]["repetition_penalty"],
last_n_tokens=config["ai"]["last_n_tokens"],
seed=config["ai"]["seed"],
top_k=config["ai"]["top_k"],
top_p=config["ai"]["top_p"],
stop=config["ai"]["stop"],
batch_size=config["ai"]["batch_size"],
gpu_layers=config["ai"]["gpu_layers"]
)

print("TOKENS: ", len(model.tokenize(new_history)))
# A lot

new_history_crop = model.tokenize(new_history)
# Take only allowed length - 3 elements from the end
new_history_crop = new_history_crop[-(config["ai"]["context_length"] - 3):]
print("CONTEXT LENGTH: ", -(config["ai"]["context_length"] - 3))

# This will be 509 (allowed 512)

print(len(new_history_crop))
response = model(model.detokenize(new_history_crop), stream=False)

But generator results error:

Number of tokens (513) exceeded maximum context length (512).
Number of tokens (514) exceeded maximum context length (512).
Number of tokens (515) exceeded maximum context length (512).
Number of tokens (516) exceeded maximum context length (512).
Number of tokens (517) exceeded maximum context length (512).
Number of tokens (518) exceeded maximum context length (512).
Number of tokens (519) exceeded maximum context length (512).
Number of tokens (520) exceeded maximum context length (512).

...and so on.

Question: Why?

My second approach

# new_history_crop is a list of 509 tokens

response = model.generate(
tokens=new_history_crop,
top_k=config["ai"]["top_k"],
top_p=config["ai"]["top_p"],
temperature=config["ai"]["temperature"],
repetition_penalty=config["ai"]["repetition_penalty"],
last_n_tokens=config["ai"]["last_n_tokens"],
batch_size=config["ai"]["batch_size"],
threads=config["ai"]["threads"],
)

response = model.detokenize(list(response))

And this works! But here's 2 problems:

1. It's slower
2. It doesn't support all parameters like from the first approach

Please help me fix this and/or explain why it is so.

marella / ctransformers

Something wrong with a generator #180

My first approach

My second approach