Closed dimafa closed 2 months ago
I also need this feature, as I need to return the sequence of each token for some business scenarios, such as token splitting.
Hmm, I forgot that I wanted to get back to this issue after the performance optimizations are merged, sorry about that!
I'm not opposed to adding this functionality. If you, @dimafa (or anyone who wants to pick this issue up), would kindly adapt the PR to the new code structure, I would gladly merge it
I also need this feature, as I need to return the sequence of each token for some business scenarios, such as token splitting.
After obtaining the encoded results, I iteratively decode each token and check if its content matches the expected string. This allows me to obtain the token sequence I want, as well as the positioning information related to the tokens. The relevant implementation is available for reference.
List<Integer> encoded = encoding.encode(input).boxed();
List<Token> result = new ArrayList<>();
StringBuilder contentBuilder = new StringBuilder(input);
// Pointer for contentBuilder
int bufferPoint = 0;
IntArrayList tokenCollect = new IntArrayList();
for (int i = 0; i < encoded.size(); i++) {
// Decode each token
tokenCollect.add(encoded.get(i));
String decodeResult = encoding.decode(tokenCollect);
// If the decode result does not match the substring of the content pointer, it means not all tokens are involved in decoding, more tokens are needed for decoding
if (!contentBuilder.substring(bufferPoint, bufferPoint + decodeResult.length()).equals(decodeResult)) {
continue;
}
// Match successful, move the pointer and collect tokens
bufferPoint += decodeResult.length();
// String position [bufferPoint, bufferPoint+decodeResult.length())
result.add(new Token(decodeResult, tokenCollect.boxed()));
tokenCollect.clear();
}
I've created a PR for this issue, please check it out : #97
Is released with version 1.1.0
To help with string chunking it would be very helpful to include the last token string position in the EncodingResult when encoding is requested with a given maxTokens. This way one could efficiently use the library to chunk a string based on given number of tokens in each chunk