Context window and max_tokens management

Running the "Write unit tests" command with a local Llama 2 model I get an error message because of the default 1000 max_tokens param:

llama_predict: error: prompt is too long (1133 tokens, max 1020)

I would like to be able to set the context window size of the model (Llama 2 is 4096 tokens). This way the max_tokensparam value could be automatically calculated using the llama-tokenizer-js lib:

import llamaTokenizer from 'llama-tokenizer-js';

const promptNtokens = llamaTokenizer.encode(prompt).length;
const maxTokens = model_context_window_size_param - promptNTokens

[Edit]: it would need another tokenizer for the OpenAi, this one is for local models

nvms / wingman

Context window and max_tokens management #17