As the first step in optimizing the c100k parser mostly (used for GPT 3.5 & 4), here's the regex optimization applying to all 50k and the 100k parsers.
The difference is not huge, but measurable:
Before:
Benchmark (dataFolderPath) Mode Cnt Score Error Units
SingleThreadedBenchmark.benchmarkCl100kBase data ss 10 10.548 ± 0.885 s/op
SingleThreadedBenchmark.benchmarkP50kBase data ss 10 9.999 ± 0.097 s/op
SingleThreadedBenchmark.benchmarkP50kEdit data ss 10 10.184 ± 0.131 s/op
SingleThreadedBenchmark.benchmarkR50kBase data ss 10 9.938 ± 0.076 s/op
After:
Benchmark (dataFolderPath) Mode Cnt Score Error Units
SingleThreadedBenchmark.benchmarkCl100kBase data ss 10 8.947 ± 0.109 s/op
SingleThreadedBenchmark.benchmarkP50kBase data ss 10 9.419 ± 0.082 s/op
SingleThreadedBenchmark.benchmarkP50kEdit data ss 10 9.365 ± 0.073 s/op
SingleThreadedBenchmark.benchmarkR50kBase data ss 10 8.403 ± 0.080 s/op
As the first step in optimizing the c100k parser mostly (used for GPT 3.5 & 4), here's the regex optimization applying to all 50k and the 100k parsers.
The difference is not huge, but measurable:
Before:
After:
Please review commit-by-commit for the changes to make sense:
Feel free to either comment - or if it's simpler -, add commits on top of these.