Closed jiangying000 closed 1 year ago
Sure:
final EncodingRegistry registry = Encodings.newDefaultEncodingRegistry();
final Encoding enc = registry.getEncoding(EncodingType.CL100K_BASE);
final int n = 2;
final String s = "hello world, great to see you!";
final List<Integer> encoded = enc.encode(s);
final List<Integer> truncated = encoded.subList(0, n);
final String decoded = enc.decode(truncated);
System.out.println(decoded);
// prints: hello world
Note, that depending on your input text, the decoded text can contain non-printable chars. This can happen, when multiple-byte unicode characters (f. ex. emojis) that map to multiple tokens are encoded and happen to be truncated. For example for s = I love 🍕
and n = 3
the tokens [40, 3021, 11410, 235, 243]
will be truncated to [40, 3021, 11410]
where 40 corresponds to I
, 3021 corresponds to love
and 11410 corresponds to a space and the first byte of the 3-byte unicode representation of 🍕
Edit: Here, a visual explanation with a different encoding, but would result in the same edge-case if truncated after 3 tokens:
This is excellent. thank you
Say I have a very long string
s
and i have a limited amount of tokenn
I want to get a substring the of the original string start from index 0 and is as long as possible, given it doesn't cost token exceed the token amount specified.
Let's say s = "hello world, great to see you" n = 2
then I probably get "hello world"