Closed lesters closed 3 months ago
Hey @lesters thanks for the pull request! This was indeed not possible, so it's a nice addition.
In general it's best to align the C++ code as close to the llama.cpp server code as possible since that way it's easier to maintain in the long term. I think there the chat template is loaded once when initializing a model (so via ModelParameters
instead of InferenceParameters
). Do you think it's necessary to be able to change the template for each inference?
However, the server has a separate endpoint for chat completions, which the Java binding doesn't have. So to still be able to choose for each inference whether to use the chat template, I think using something like your PARAM_USE_CHAT_TEMPLATE
is the way to go.
I will review the PR in more detail tomorrow.
Thanks, @kherud. I think you are right, this is probably better as a part of ModelParameters
- as you most likely don't want to change this much for each inference. I'll make the change.
Looks good to me, thanks for the work.
This adds the option of applying a chat template such as those found in GGUF files to the prompt before generating tokens. This is a suggestion as I couldn't find a way of doing this in the code today.