Improve query formatting & sizeing

huggingface transformers 4.34, which is quite new, has support for "chat templates" and can also tell you the size of a chat in tokens.

however, a lot of models don't have the required chat templates (yet?) and getting chat templates for some models (e.g., llama2) requires special permission even if a derived quantized model was not behind a signup wall.

Use this tech, or something like it, to replace the hard-coded query formatting of the llama_cpp backend and to improve the length of the query itself instead of having the hard-coded limit of 5 messages.

jepler / chap

Improve query formatting & sizeing #23