Closed nayavu closed 2 months ago
I implemented GPT4o based on custom registration, and here's how to do it:
Map<byte[], Integer> encoder = loadFromFile("/jtokkit/o200k_base.tiktoken");
Map<String, Integer> specialTokensEncoder = new HashMap<>();
specialTokensEncoder.put("<|endoftext|>",199999);
specialTokensEncoder.put("<|endofprompt|>",200018);
GptBytePairEncodingParams params = new GptBytePairEncodingParams(
"o200k_base",
Pattern.compile("[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]*[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]+(?i:'s|'t|'re|'ve|'m|'ll|'d)?|[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]+[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]*(?i:'s|'t|'re|'ve|'m|'ll|'d)?|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n/]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"),
encoder,
specialTokensEncoder
);
registry.registerGptBytePairEncoding(params);
Encoding gpt4oEncoding = registry.getEncoding("o200k_base").get();
o200k_base.tiktoken download from this address:https://openaipublic.blob.core.windows.net/encodings/o200k_base.tiktoken
loadFromFile() reference EncodingFactory.loadMergeableRanks()
Yes, please please please! :-)
When can this be release in maven central? :)
Is released now with version 1.1.0
It has been just added for the new
gpt-4o
model: https://github.com/openai/tiktoken/commit/9d01e5670ff50eb74cdb96406c7f3d9add0ae2f8