knuddelsgmbh / jtokkit

JTokkit is a Java tokenizer library designed for use with OpenAI models.
https://jtokkit.knuddels.de/
MIT License
552 stars 41 forks source link

Implement o200k_base encoding #96

Closed nayavu closed 2 months ago

nayavu commented 4 months ago

It has been just added for the new gpt-4o model: https://github.com/openai/tiktoken/commit/9d01e5670ff50eb74cdb96406c7f3d9add0ae2f8

8796123 commented 3 months ago

I implemented GPT4o based on custom registration, and here's how to do it:

Map<byte[], Integer> encoder = loadFromFile("/jtokkit/o200k_base.tiktoken");
Map<String, Integer> specialTokensEncoder = new HashMap<>();
specialTokensEncoder.put("<|endoftext|>",199999);
specialTokensEncoder.put("<|endofprompt|>",200018);

GptBytePairEncodingParams params = new GptBytePairEncodingParams(
        "o200k_base",
        Pattern.compile("[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]*[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]+(?i:'s|'t|'re|'ve|'m|'ll|'d)?|[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]+[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]*(?i:'s|'t|'re|'ve|'m|'ll|'d)?|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n/]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"),
        encoder,
        specialTokensEncoder
);
registry.registerGptBytePairEncoding(params);
Encoding gpt4oEncoding = registry.getEncoding("o200k_base").get();

o200k_base.tiktoken download from this address:https://openaipublic.blob.core.windows.net/encodings/o200k_base.tiktoken

loadFromFile() reference EncodingFactory.loadMergeableRanks()

stoerr commented 3 months ago

Yes, please please please! :-)

stefanos-kalantzis commented 2 months ago

When can this be release in maven central? :)

Plexcalibur commented 2 months ago

Is released now with version 1.1.0