Generate with chat template

lesters commented 3 months ago

This adds the option of applying a chat template such as those found in GGUF files to the prompt before generating tokens. This is a suggestion as I couldn't find a way of doing this in the code today.

kherud commented 3 months ago

Hey @lesters thanks for the pull request! This was indeed not possible, so it's a nice addition.

In general it's best to align the C++ code as close to the llama.cpp server code as possible since that way it's easier to maintain in the long term. I think there the chat template is loaded once when initializing a model (so via ModelParameters instead of InferenceParameters). Do you think it's necessary to be able to change the template for each inference?

However, the server has a separate endpoint for chat completions, which the Java binding doesn't have. So to still be able to choose for each inference whether to use the chat template, I think using something like your PARAM_USE_CHAT_TEMPLATE is the way to go.

I will review the PR in more detail tomorrow.

lesters commented 3 months ago

Thanks, @kherud. I think you are right, this is probably better as a part of ModelParameters - as you most likely don't want to change this much for each inference. I'll make the change.

kherud commented 3 months ago

Looks good to me, thanks for the work.

kherud / java-llama.cpp

Generate with chat template #64