Especially https://github.com/openai/tiktoken/issues/245, which revealed that the legacy encodings have a severe backtracking problem - though it seems that Java's engine can handle it properly, since adding the possessives made it a tiny bit slower - but at least the regex is in sync with tiktoken (note that cl100k is unaffected by the regex changes, only the tokenCount > 2 change is related):
Before:
Benchmark (dataFolderPath) Mode Cnt Score Error Units
SingleThreadedBenchmark.benchmarkCl100kBase data ss 10 2.268 ± 0.050 s/op
SingleThreadedBenchmark.benchmarkCl100kBaseTokenCount data ss 10 2.075 ± 0.025 s/op
SingleThreadedBenchmark.benchmarkCl100kBaseTokenCountOrdinary data ss 10 2.072 ± 0.028 s/op
SingleThreadedBenchmark.benchmarkP50kBase data ss 10 4.087 ± 0.023 s/op
SingleThreadedBenchmark.benchmarkP50kEdit data ss 10 4.131 ± 0.093 s/op
SingleThreadedBenchmark.benchmarkR50kBase data ss 10 3.802 ± 0.025 s/op
After - a bit slower for some reason, but at least it's synchronized with tiktoken:
Benchmark (dataFolderPath) Mode Cnt Score Error Units
SingleThreadedBenchmark.benchmarkCl100kBase data ss 10 2.231 ± 0.015 s/op
SingleThreadedBenchmark.benchmarkCl100kBaseTokenCount data ss 10 2.101 ± 0.029 s/op
SingleThreadedBenchmark.benchmarkCl100kBaseTokenCountOrdinary data ss 10 2.055 ± 0.041 s/op
SingleThreadedBenchmark.benchmarkP50kBase data ss 10 4.440 ± 0.082 s/op
SingleThreadedBenchmark.benchmarkP50kEdit data ss 10 4.451 ± 0.085 s/op
SingleThreadedBenchmark.benchmarkR50kBase data ss 10 4.086 ± 0.019 s/op
Made this a draft since the tiktoken PRs aren't fully finished yet, so I expect to have a few more changes here as well.
After porting the optimizations ending in https://github.com/knuddelsgmbh/jtokkit/pull/77 back to tiktoken, this is a synchronization PR based on the reviews there.
Especially https://github.com/openai/tiktoken/issues/245, which revealed that the legacy encodings have a severe backtracking problem - though it seems that Java's engine can handle it properly, since adding the possessives made it a tiny bit slower - but at least the regex is in sync with tiktoken (note that cl100k is unaffected by the regex changes, only the
tokenCount > 2
change is related):Before:
After - a bit slower for some reason, but at least it's synchronized with tiktoken:
Made this a draft since the tiktoken PRs aren't fully finished yet, so I expect to have a few more changes here as well.