Set handling of special tokens in tokenizer to true

kherud / java-llama.cpp

Java Bindings for llama.cpp - A Port of Facebook's LLaMA model in C/C++

MIT License

279 stars 28 forks source link

Set handling of special tokens in tokenizer to true #46

Closed hvisser closed 7 months ago

hvisser commented 7 months ago

Fixes #45

kherud commented 7 months ago

Thanks for the pull request! I changed it slightly to keep the default behavior, but allow to set tokenize_special via the InferenceParameters like:

InferenceParameters inferParams = new InferenceParameters().setTokenizeSpecial(true);
// ...
model.generate(prompt, inferParams)

hvisser commented 7 months ago

Thanks, though I wonder why one would ever not want to tokenize these special tokens if the model has them. The whole point of adding these "special" tokens is to treat a sequence as a single token as I understand it. Moving this to the inference parameters puts the burden on the user of the library and since I was looking for the issue for a few hours, I suspect this isn't obvious for anyone encountering the same.

kherud commented 7 months ago

I agree, per default most users will want to tokenize these tokens. My reasoning was to stick to the default behavior of llama.cpp, where the parameter is false by default. I guess it's useful if you want to talk about those tokens as if they were text (or get answers containing them), without triggering their special functionality. I'll change the default to true in the Java binding, though.

hvisser commented 7 months ago

Changing the default is a good solution! Thanks again 😁 It might be false by default, but llama.cpp main example always sets it to true as well. So having that same default allows for comparing the two.