knuddelsgmbh / jtokkit

JTokkit is a Java tokenizer library designed for use with OpenAI models.
https://jtokkit.knuddels.de/
MIT License
553 stars 42 forks source link

1) Optimize 50k & 100k splitter regular expressions - 10.5s to 8.9s #75

Closed l0rinc closed 9 months ago

l0rinc commented 9 months ago

As the first step in optimizing the c100k parser mostly (used for GPT 3.5 & 4), here's the regex optimization applying to all 50k and the 100k parsers.

The difference is not huge, but measurable:

Before:

Benchmark                                    (dataFolderPath)  Mode  Cnt   Score   Error  Units
SingleThreadedBenchmark.benchmarkCl100kBase              data    ss   10  10.548 ± 0.885   s/op
SingleThreadedBenchmark.benchmarkP50kBase                data    ss   10   9.999 ± 0.097   s/op
SingleThreadedBenchmark.benchmarkP50kEdit                data    ss   10  10.184 ± 0.131   s/op
SingleThreadedBenchmark.benchmarkR50kBase                data    ss   10   9.938 ± 0.076   s/op

After:

Benchmark                                    (dataFolderPath)  Mode  Cnt  Score   Error  Units
SingleThreadedBenchmark.benchmarkCl100kBase              data    ss   10  8.947 ± 0.109   s/op
SingleThreadedBenchmark.benchmarkP50kBase                data    ss   10  9.419 ± 0.082   s/op
SingleThreadedBenchmark.benchmarkP50kEdit                data    ss   10  9.365 ± 0.073   s/op
SingleThreadedBenchmark.benchmarkR50kBase                data    ss   10  8.403 ± 0.080   s/op

Please review commit-by-commit for the changes to make sense:

image

Feel free to either comment - or if it's simpler -, add commits on top of these.