knuddelsgmbh / jtokkit

JTokkit is a Java tokenizer library designed for use with OpenAI models.
https://jtokkit.knuddels.de/
MIT License
518 stars 38 forks source link

Is it possible to get sliced string with limited token #3

Closed jiangying000 closed 1 year ago

jiangying000 commented 1 year ago

Say I have a very long string s and i have a limited amount of token n

I want to get a substring the of the original string start from index 0 and is as long as possible, given it doesn't cost token exceed the token amount specified.

Let's say s = "hello world, great to see you" n = 2

then I probably get "hello world"

tox-p commented 1 year ago

Sure:

final EncodingRegistry registry = Encodings.newDefaultEncodingRegistry();
final Encoding enc = registry.getEncoding(EncodingType.CL100K_BASE);

final int n = 2;
final String s = "hello world, great to see you!";

final List<Integer> encoded = enc.encode(s);
final List<Integer> truncated = encoded.subList(0, n);
final String decoded = enc.decode(truncated);
System.out.println(decoded);
// prints: hello world

Note, that depending on your input text, the decoded text can contain non-printable chars. This can happen, when multiple-byte unicode characters (f. ex. emojis) that map to multiple tokens are encoded and happen to be truncated. For example for s = I love 🍕 and n = 3 the tokens [40, 3021, 11410, 235, 243] will be truncated to [40, 3021, 11410] where 40 corresponds to I, 3021 corresponds to love and 11410 corresponds to a space and the first byte of the 3-byte unicode representation of 🍕

Edit: Here, a visual explanation with a different encoding, but would result in the same edge-case if truncated after 3 tokens: image

jiangying000 commented 1 year ago

This is excellent. thank you