How to handle the token limitation for a LLM response?

marella / ctransformers

Python bindings for the Transformer models implemented in C/C++ using GGML library.

MIT License

1.76k stars 137 forks source link

How to handle the token limitation for a LLM response? #164

Open phoenixthinker opened 8 months ago

phoenixthinker commented 8 months ago

Hi,

When the LLM generate a long answer that exceeded 512 tokens, the program start to show warning message like this: WARNING:ctransformers:Number of tokens (513) exceeded maximum context length (512) WARNING:ctransformers:Number of tokens (514) exceeded maximum context length (512) ... ........

In my use case, for example, using a 7b LLM for Q&A application. The LLM response (the generated answer) is always longer then 512 tokens. Can anyone suggest a solution or show some simple code to handle this problem? Thanks.

phoenixthinker commented 8 months ago

config = {'max_new_tokens': 2048, 'context_length': 8192, # <------ Solved by adding this line 'repetition_penalty': 1.1, 'temperature': 0.1, 'top_k': 50, 'top_p': 0.9, 'stream': True, # streaming per word/token 'threads': int(os.cpu_count() / 2), # adjust for your CPU }

hdnh2006 commented 6 months ago

config = {'max_new_tokens': 2048, 'context_length': 8192, # <------ Solved by adding this line 'repetition_penalty': 1.1, 'temperature': 0.1, 'top_k': 50, 'top_p': 0.9, 'stream': True, # streaming per word/token 'threads': int(os.cpu_count() / 2), # adjust for your CPU }

Did you modify it like this?

model = AutoModelForCausalLM.from_pretrained("TheBloke/openchat_3.5-GGUF", model_file="openchat_3.5.Q5_K_M.gguf", model_type="mistral", gpu_layers=0, max_new_tokens=1024, context_length= 8192)

how do you passs this args?

Thanks