knuddelsgmbh / jtokkit

JTokkit is a Java tokenizer library designed for use with OpenAI models.
https://jtokkit.knuddels.de/
MIT License
553 stars 42 forks source link

Return position to EncodingResult #80

Closed dimafa closed 2 months ago

dimafa commented 8 months ago

To help with string chunking it would be very helpful to include the last token string position in the EncodingResult when encoding is requested with a given maxTokens. This way one could efficiently use the library to chunk a string based on given number of tokens in each chunk

wertycn commented 6 months ago

I also need this feature, as I need to return the sequence of each token for some business scenarios, such as token splitting.

tox-p commented 6 months ago

Hmm, I forgot that I wanted to get back to this issue after the performance optimizations are merged, sorry about that!

I'm not opposed to adding this functionality. If you, @dimafa (or anyone who wants to pick this issue up), would kindly adapt the PR to the new code structure, I would gladly merge it

wertycn commented 6 months ago

I also need this feature, as I need to return the sequence of each token for some business scenarios, such as token splitting.

After obtaining the encoded results, I iteratively decode each token and check if its content matches the expected string. This allows me to obtain the token sequence I want, as well as the positioning information related to the tokens. The relevant implementation is available for reference.

List<Integer> encoded = encoding.encode(input).boxed();
List<Token> result = new ArrayList<>();

StringBuilder contentBuilder = new StringBuilder(input);
// Pointer for contentBuilder
int bufferPoint = 0;
IntArrayList tokenCollect = new IntArrayList();

for (int i = 0; i < encoded.size(); i++) {
    // Decode each token
    tokenCollect.add(encoded.get(i));
    String decodeResult = encoding.decode(tokenCollect);
    // If the decode result does not match the substring of the content pointer, it means not all tokens are involved in decoding, more tokens are needed for decoding
    if (!contentBuilder.substring(bufferPoint, bufferPoint + decodeResult.length()).equals(decodeResult)) {
        continue;
    }
    // Match successful, move the pointer and collect tokens
    bufferPoint += decodeResult.length();
    // String position [bufferPoint, bufferPoint+decodeResult.length())
    result.add(new Token(decodeResult, tokenCollect.boxed()));
    tokenCollect.clear();
}
imsosleepy commented 4 months ago

I've created a PR for this issue, please check it out : #97

Plexcalibur commented 2 months ago

Is released with version 1.1.0