marella / ctransformers

Python bindings for the Transformer models implemented in C/C++ using GGML library.
MIT License
1.76k stars 137 forks source link

Something wrong with a generator #180

Closed yukiarimo closed 7 months ago

yukiarimo commented 7 months ago

My first approach

model = AutoModelForCausalLM.from_pretrained(
config["server"]["models_dir"] + config["server"]["default_model_file"],
model_type='llama2',
max_new_tokens=config["ai"]["max_new_tokens"],
context_length=config["ai"]["context_length"],
temperature=config["ai"]["temperature"],
repetition_penalty=config["ai"]["repetition_penalty"],
last_n_tokens=config["ai"]["last_n_tokens"],
seed=config["ai"]["seed"],
top_k=config["ai"]["top_k"],
top_p=config["ai"]["top_p"],
stop=config["ai"]["stop"],
batch_size=config["ai"]["batch_size"],
gpu_layers=config["ai"]["gpu_layers"]
)

print("TOKENS: ", len(model.tokenize(new_history)))
# A lot

new_history_crop = model.tokenize(new_history)
# Take only allowed length - 3 elements from the end
new_history_crop = new_history_crop[-(config["ai"]["context_length"] - 3):]
print("CONTEXT LENGTH: ", -(config["ai"]["context_length"] - 3))

# This will be 509 (allowed 512)

print(len(new_history_crop))
response = model(model.detokenize(new_history_crop), stream=False)

But generator results error:

Number of tokens (513) exceeded maximum context length (512).
Number of tokens (514) exceeded maximum context length (512).
Number of tokens (515) exceeded maximum context length (512).
Number of tokens (516) exceeded maximum context length (512).
Number of tokens (517) exceeded maximum context length (512).
Number of tokens (518) exceeded maximum context length (512).
Number of tokens (519) exceeded maximum context length (512).
Number of tokens (520) exceeded maximum context length (512).

...and so on.

Question: Why?

My second approach

# new_history_crop is a list of 509 tokens

response = model.generate(
tokens=new_history_crop,
top_k=config["ai"]["top_k"],
top_p=config["ai"]["top_p"],
temperature=config["ai"]["temperature"],
repetition_penalty=config["ai"]["repetition_penalty"],
last_n_tokens=config["ai"]["last_n_tokens"],
batch_size=config["ai"]["batch_size"],
threads=config["ai"]["threads"],
)

response = model.detokenize(list(response))

And this works! But here's 2 problems:

1. It's slower
2. It doesn't support all parameters like from the first approach

Please help me fix this and/or explain why it is so.

yukiarimo commented 7 months ago

I found a solution myself:

  1. Second approach is a shit
  2. Context length is used for both model in and out