Add 'maxTokens' parameter to the 'encode' and 'encodeOrdinary' methods

radoslavdodek commented 1 year ago

Hello Philip. Thank you for all the work you have put into this library. It is very useful indeed!

May I propose to add maxTokens parameter to encode methods?

Motivation: Quite often, we must ensure that we don't exceed a certain number of tokens when encoding some text. That parameter could help in such cases.

Please let me know your thoughts.

Have a great day, Rado.

tox-p commented 1 year ago

Hey,

can you give me some more details about your use-case?

Currently, I fail to see why you could not simply do

final List<Integer> truncatedTokens = enc.encode(input).subList(0, maxTokens);

Sure, that wastes some CPU cycles, but if you are not trying to encode an enormous text, it should basically be negligible.

Currently, I am slightly opposed to this addition, because of the following reasons:

Disregarding the performance aspect I mentioned above, it is easily implemented outside of the library.
Truncating the input breaks round-tripping between encoding and decoding. This can happen, if multiple-byte unicode characters, that encode to multiple tokens, are truncated. See the explanation here: https://github.com/knuddelsgmbh/jtokkit/issues/3#issuecomment-1494626671
In my experience, truncating is not a very common use-case. Most applications use chunking strategies to split a long input text into smaller texts, that all fit a specific token threshold. A good article expanding on different chunking strategies can be found here: https://www.pinecone.io/learn/chunking-strategies/. But I am open to correction, if there are somee valid use-cases I don't know about 😄

radoslavdodek commented 1 year ago

Thank you for your prompt reply! :)

I thought about solving it by simply truncating the tokens list outside of the library, as you suggested. But I would like to avoid encoding tokens which will be thrown away anyway.
"breaks round-tripping between encoding and decoding"

This is definitely a valid argument.
Chunking doesn't play well for our use case. We ask ChatGPT to generate a valid JSON for our input, and with chunking we noticed that it is more likely that the resulting JSON won't be valid. We would much rather truncate the input text instead.

But all that said, I understand your reasoning.

tox-p commented 1 year ago

Hmm, interesting use-case 🙂 I guess there is little harm in also supporting this, as long as the caveat of breakage of the encode-decode roundtrip is well documented

Also, I just double-checked the OpenAI documentation and at least for the non-chat completion endpoint, an array of tokens is also a valid input. Which would leave this method quite handy for use-cases where lossy input is acceptable, since no decoding of the tokens would be necessary to prompt the model

I'll take a look at your PR 🙂

radoslavdodek commented 1 year ago

Thanks, Philip! :-)

radoslavdodek commented 1 year ago

@tox-p I was thinking about how to avoid issues with the round-tripping between encoding and decoding. However, I'm not sure if you want to have special handling for such a case.

Let's say we have scenario which you described in the comment you linked above:

User calls:

List<Integer> encoded = enc.encode("I love 🍕");  // "I love \uD83C\uDF55"

encoded = [40, 3021, 11410, 235, 243]

If we try to decode it back, we would get the original text, which is fine

But if he calls:

List<Integer> encoded = enc.encode("I love 🍕", 4);  // "I love \uD83C\uDF55"

encoded = [40, 3021, 11410, 235]

When he decodes it back, the last character would be corrupted:

"I love �"

What we could do is the following:

If the maxTokens parameter is not supplied, the implementation can remain as is
If maxTokens is provided, we will:
- Encode the input text (up to maxTokens tokens)
- We will decode it back internally to test if the decoded text matches the head of the input string
- If not, we will decrease the number of maxTokens and repeat the procedure until our decoded string matches the head of the input string. We would return the tokens list.

So in the example above, the returned tokens will be:

[40, 3021]

Doing so would avoid the issue with the multibyte characters. Of course, it will have an impact on the performance, but only for the use case when maxTokens parameter is provided. Your original use case will remain the same (performance-wise).

What do you think?

tox-p commented 1 year ago

Hmm, sounds good to me and would probably produce outputs that are more useful. I think the performance overhead is manageable, since this case will probably not happen that often and one has to opt-in into that behaviour by using the maxTokens variant. This would also offer an additional benefit instead of just using encode(x).sublist(0, maxTokens) since the library handles this edge case

But maybe the signature of encode with maxTokens should change and return an EncodingResult containing the list of tokens and a truncated boolean.

With the algorithm before one could make an educated guess, that the input was truncated when the token amount was matching the maxTokens (at least with the change I commented in the PR)

With this new approach there is no longer any convenient means to check if it was truncated, except manually decoding and checking against the original input

radoslavdodek commented 1 year ago

@tox-p: I'm working on changing the signature of encode with MaxTokens as you proposed.

radoslavdodek commented 1 year ago

@tox-p: Please review it again when you have time.

Regarding the tests: I've added one more column the the CSV files: outputMaxTokens10 which contains list of tokens expected when encoding the input with maxTokens = 10.

tox-p commented 1 year ago

Looks great! I'm gonna merge it and prepare the next release containing your addition :slightly_smiling_face: thank you for raising this feature request and contributing it yourself :blush:

radoslavdodek commented 1 year ago

Thank you @tox-p ! :)

knuddelsgmbh / jtokkit

Add 'maxTokens' parameter to the 'encode' and 'encodeOrdinary' methods #12