knuddelsgmbh / jtokkit

JTokkit is a Java tokenizer library designed for use with OpenAI models.
https://jtokkit.knuddels.de/
MIT License
518 stars 38 forks source link

different token values between java and python #33

Closed VoidIsVoid closed 1 year ago

VoidIsVoid commented 1 year ago

I find that there are some difference between official python and jtokkit.

In Java

final Encoding encodingForModel = registry.getEncodingForModel(ModelType.GPT_3_5_TURBO);
final String s1 = "\u3000\u3000";
System.out.println(encodingForModel.encode(s1));
// [44529]
final String s2 = "\u3000\u3000a";
System.out.println(encodingForModel.encode(s2));
// [44529, 64]

But in Python

# coding=utf-8
import tiktoken
encoding = tiktoken.encoding_for_model('gpt-3.5-turbo')
print(encoding.encode('\u3000\u3000'))
# [44529]
print(encoding.encode('\u3000\u3000a'))
# [23249, 23249, 64]

Please fix it.

tox-p commented 1 year ago

Thanks for the fix! :slightly_smiling_face: I published a new release, 0.5.1, containing your fix