knuddelsgmbh / jtokkit

JTokkit is a Java tokenizer library designed for use with OpenAI models.
https://jtokkit.knuddels.de/
MIT License
516 stars 38 forks source link

Sync with tiktoken #87

Open paplorinc opened 4 months ago

paplorinc commented 4 months ago

After porting the optimizations ending in https://github.com/knuddelsgmbh/jtokkit/pull/77 back to tiktoken, this is a synchronization PR based on the reviews there.

Especially https://github.com/openai/tiktoken/issues/245, which revealed that the legacy encodings have a severe backtracking problem - though it seems that Java's engine can handle it properly, since adding the possessives made it a tiny bit slower - but at least the regex is in sync with tiktoken (note that cl100k is unaffected by the regex changes, only the tokenCount > 2 change is related):

Before:

Benchmark                                                      (dataFolderPath)  Mode  Cnt  Score   Error  Units
SingleThreadedBenchmark.benchmarkCl100kBase                                data    ss   10  2.268 ± 0.050   s/op
SingleThreadedBenchmark.benchmarkCl100kBaseTokenCount                      data    ss   10  2.075 ± 0.025   s/op
SingleThreadedBenchmark.benchmarkCl100kBaseTokenCountOrdinary              data    ss   10  2.072 ± 0.028   s/op
SingleThreadedBenchmark.benchmarkP50kBase                                  data    ss   10  4.087 ± 0.023   s/op
SingleThreadedBenchmark.benchmarkP50kEdit                                  data    ss   10  4.131 ± 0.093   s/op
SingleThreadedBenchmark.benchmarkR50kBase                                  data    ss   10  3.802 ± 0.025   s/op

After - a bit slower for some reason, but at least it's synchronized with tiktoken:

Benchmark                                                      (dataFolderPath)  Mode  Cnt  Score   Error  Units
SingleThreadedBenchmark.benchmarkCl100kBase                                data    ss   10  2.231 ± 0.015   s/op
SingleThreadedBenchmark.benchmarkCl100kBaseTokenCount                      data    ss   10  2.101 ± 0.029   s/op
SingleThreadedBenchmark.benchmarkCl100kBaseTokenCountOrdinary              data    ss   10  2.055 ± 0.041   s/op
SingleThreadedBenchmark.benchmarkP50kBase                                  data    ss   10  4.440 ± 0.082   s/op
SingleThreadedBenchmark.benchmarkP50kEdit                                  data    ss   10  4.451 ± 0.085   s/op
SingleThreadedBenchmark.benchmarkR50kBase                                  data    ss   10  4.086 ± 0.019   s/op

Made this a draft since the tiktoken PRs aren't fully finished yet, so I expect to have a few more changes here as well.