Closed radoslavdodek closed 1 year ago
Hey,
can you give me some more details about your use-case?
Currently, I fail to see why you could not simply do
final List<Integer> truncatedTokens = enc.encode(input).subList(0, maxTokens);
Sure, that wastes some CPU cycles, but if you are not trying to encode an enormous text, it should basically be negligible.
Currently, I am slightly opposed to this addition, because of the following reasons:
Thank you for your prompt reply! :)
I thought about solving it by simply truncating the tokens list outside of the library, as you suggested. But I would like to avoid encoding tokens which will be thrown away anyway.
"breaks round-tripping between encoding and decoding"
This is definitely a valid argument.
Chunking doesn't play well for our use case. We ask ChatGPT to generate a valid JSON for our input, and with chunking we noticed that it is more likely that the resulting JSON won't be valid. We would much rather truncate the input text instead.
But all that said, I understand your reasoning.
Hmm, interesting use-case 🙂 I guess there is little harm in also supporting this, as long as the caveat of breakage of the encode-decode roundtrip is well documented
Also, I just double-checked the OpenAI documentation and at least for the non-chat completion endpoint, an array of tokens is also a valid input. Which would leave this method quite handy for use-cases where lossy input is acceptable, since no decoding of the tokens would be necessary to prompt the model
I'll take a look at your PR 🙂
Thanks, Philip! :-)
@tox-p I was thinking about how to avoid issues with the round-tripping between encoding and decoding. However, I'm not sure if you want to have special handling for such a case.
Let's say we have scenario which you described in the comment you linked above:
User calls:
List<Integer> encoded = enc.encode("I love 🍕"); // "I love \uD83C\uDF55"
encoded = [40, 3021, 11410, 235, 243]
If we try to decode it back, we would get the original text, which is fine
But if he calls:
List<Integer> encoded = enc.encode("I love 🍕", 4); // "I love \uD83C\uDF55"
encoded = [40, 3021, 11410, 235]
When he decodes it back, the last character would be corrupted:
"I love �"
What we could do is the following:
maxTokens
parameter is not supplied, the implementation can remain as ismaxTokens
is provided, we will:
maxTokens
tokens)maxTokens
and repeat the procedure until our decoded string matches the head of the input string. We would return the tokens list. So in the example above, the returned tokens will be:
[40, 3021]
Doing so would avoid the issue with the multibyte characters. Of course, it will have an impact on the performance, but only for the use case when maxTokens
parameter is provided. Your original use case will remain the same (performance-wise).
What do you think?
Hmm, sounds good to me and would probably produce outputs that are more useful. I think the performance overhead is manageable, since this case will probably not happen that often and one has to opt-in into that behaviour by using the maxTokens variant. This would also offer an additional benefit instead of just using encode(x).sublist(0, maxTokens) since the library handles this edge case
But maybe the signature of encode with maxTokens should change and return an EncodingResult containing the list of tokens and a truncated
boolean.
With the algorithm before one could make an educated guess, that the input was truncated when the token amount was matching the maxTokens (at least with the change I commented in the PR)
With this new approach there is no longer any convenient means to check if it was truncated, except manually decoding and checking against the original input
@tox-p: I'm working on changing the signature of encode
with MaxTokens as you proposed.
@tox-p: Please review it again when you have time.
Regarding the tests: I've added one more column the the CSV files: outputMaxTokens10
which contains list of tokens expected when encoding the input with maxTokens = 10
.
Looks great! I'm gonna merge it and prepare the next release containing your addition :slightly_smiling_face: thank you for raising this feature request and contributing it yourself :blush:
Thank you @tox-p ! :)
Hello Philip. Thank you for all the work you have put into this library. It is very useful indeed!
May I propose to add
maxTokens
parameter to encode methods?Motivation: Quite often, we must ensure that we don't exceed a certain number of tokens when encoding some text. That parameter could help in such cases.
Please let me know your thoughts.
Have a great day, Rado.