knuddelsgmbh / jtokkit

JTokkit is a Java tokenizer library designed for use with OpenAI models.
https://jtokkit.knuddels.de/
MIT License
518 stars 38 forks source link

Add 'maxTokens' parameter to the 'encode' and 'encodeOrdinary' methods #12

Closed radoslavdodek closed 1 year ago

radoslavdodek commented 1 year ago

Hello Philip. Thank you for all the work you have put into this library. It is very useful indeed!

May I propose to add maxTokens parameter to encode methods?

Motivation: Quite often, we must ensure that we don't exceed a certain number of tokens when encoding some text. That parameter could help in such cases.

Please let me know your thoughts.

Have a great day, Rado.

tox-p commented 1 year ago

Hey,

can you give me some more details about your use-case?

Currently, I fail to see why you could not simply do

final List<Integer> truncatedTokens = enc.encode(input).subList(0, maxTokens);

Sure, that wastes some CPU cycles, but if you are not trying to encode an enormous text, it should basically be negligible.

Currently, I am slightly opposed to this addition, because of the following reasons:

radoslavdodek commented 1 year ago

Thank you for your prompt reply! :)

But all that said, I understand your reasoning.

tox-p commented 1 year ago

Hmm, interesting use-case 🙂 I guess there is little harm in also supporting this, as long as the caveat of breakage of the encode-decode roundtrip is well documented

Also, I just double-checked the OpenAI documentation and at least for the non-chat completion endpoint, an array of tokens is also a valid input. Which would leave this method quite handy for use-cases where lossy input is acceptable, since no decoding of the tokens would be necessary to prompt the model

I'll take a look at your PR 🙂

radoslavdodek commented 1 year ago

Thanks, Philip! :-)

radoslavdodek commented 1 year ago

@tox-p I was thinking about how to avoid issues with the round-tripping between encoding and decoding. However, I'm not sure if you want to have special handling for such a case.

Let's say we have scenario which you described in the comment you linked above:

What we could do is the following:

So in the example above, the returned tokens will be:

[40, 3021]

Doing so would avoid the issue with the multibyte characters. Of course, it will have an impact on the performance, but only for the use case when maxTokens parameter is provided. Your original use case will remain the same (performance-wise).

What do you think?

tox-p commented 1 year ago

Hmm, sounds good to me and would probably produce outputs that are more useful. I think the performance overhead is manageable, since this case will probably not happen that often and one has to opt-in into that behaviour by using the maxTokens variant. This would also offer an additional benefit instead of just using encode(x).sublist(0, maxTokens) since the library handles this edge case

But maybe the signature of encode with maxTokens should change and return an EncodingResult containing the list of tokens and a truncated boolean.

With the algorithm before one could make an educated guess, that the input was truncated when the token amount was matching the maxTokens (at least with the change I commented in the PR)

With this new approach there is no longer any convenient means to check if it was truncated, except manually decoding and checking against the original input

radoslavdodek commented 1 year ago

@tox-p: I'm working on changing the signature of encode with MaxTokens as you proposed.

radoslavdodek commented 1 year ago

@tox-p: Please review it again when you have time.

Regarding the tests: I've added one more column the the CSV files: outputMaxTokens10 which contains list of tokens expected when encoding the input with maxTokens = 10.

tox-p commented 1 year ago

Looks great! I'm gonna merge it and prepare the next release containing your addition :slightly_smiling_face: thank you for raising this feature request and contributing it yourself :blush:

radoslavdodek commented 1 year ago

Thank you @tox-p ! :)