mvitlov / tiktoken

tiktoken is a BPE tokeniser for use with OpenAI's models
Other
23 stars 7 forks source link

Horrible regex performance (30x slower than original tiktoken) #7

Open l0rinc opened 9 months ago

l0rinc commented 9 months ago

Checking the performance against the original tiktoken (written in Rust) and against the Java port, the Dart regex fragment parser is insanely slow (roughly 30x slower). image

Even simple regex optimizations don't help since Dart doesn't seem to support possessives (and is most likely slow because of backtracking).

A dedicated cl100k parser could help - as done in https://github.com/knuddelsgmbh/jtokkit/pull/77.