huggingface transformers 4.34, which is quite new, has support for "chat templates" and can also tell you the size of a chat in tokens.
however, a lot of models don't have the required chat templates (yet?) and getting chat templates for some models (e.g., llama2) requires special permission even if a derived quantized model was not behind a signup wall.
Use this tech, or something like it, to replace the hard-coded query formatting of the llama_cpp backend and to improve the length of the query itself instead of having the hard-coded limit of 5 messages.
huggingface transformers 4.34, which is quite new, has support for "chat templates" and can also tell you the size of a chat in tokens.
however, a lot of models don't have the required chat templates (yet?) and getting chat templates for some models (e.g., llama2) requires special permission even if a derived quantized model was not behind a signup wall.
Use this tech, or something like it, to replace the hard-coded query formatting of the llama_cpp backend and to improve the length of the query itself instead of having the hard-coded limit of 5 messages.