openai / tiktoken

tiktoken is a fast BPE tokeniser for use with OpenAI's models.
MIT License
11.16k stars 751 forks source link

1) Optimize regular expressions used for splitting by ~20% #234

Closed paplorinc closed 5 months ago

paplorinc commented 6 months ago

By combining the contractions to a single non-capturing group prefixed by ', we can speed up matches by roughly 20%.

By using possessive quantifiers for the cl100k_base in the word and punctuation groups we're avoiding some backtracking.

The last whitespace groups can also be simplified to have a single newline matched explicitly, since the previous whitespace would already match it.

Overall the regex matches the exact same sequence of characters as before for any case and for unicode sequences.

This is the first part of the optimizations I did for jtokkit, reducing the speed of the tokenization from ~10.5 seconds to ~1.6 seconds in several big steps. If this change is accepted I'll continue migrating the changes I've made.

I've modified benchmark.py locally to measure the improvement:

def benchmark_batch(documents: list[str]) -> None:
    num_threads = int(os.environ.get("RAYON_NUM_THREADS", "1"))
    num_bytes = sum(map(len, map(str.encode, documents)))
    print(f"num_threads: {num_threads}, num_bytes: {num_bytes}")

    enc = tiktoken.get_encoding("cl100k_base")
    enc.encode("warmup")

    for _ in range(5):
        start = time.perf_counter_ns()
        enc.encode_ordinary_batch(documents, num_threads=num_threads)
        end = time.perf_counter_ns()
        bytes_per_second = num_bytes / (end - start) * 1e9
        print(f"tiktoken \t{bytes_per_second:,.0f} bytes / s")

Here the speedup is as follows:

Before:

num_threads: 1, num_bytes: 98359164
tiktoken    8,040,959 bytes / s
tiktoken    8,047,612 bytes / s
tiktoken    8,059,961 bytes / s
tiktoken    8,097,749 bytes / s
tiktoken    8,125,161 bytes / s

After regex optimization:

num_threads: 1, num_bytes: 98359164
tiktoken    9,861,159 bytes / s
tiktoken    9,888,486 bytes / s
tiktoken    9,918,514 bytes / s
tiktoken    9,902,705 bytes / s
tiktoken    9,917,494 bytes / s

The other 50k tokenizers are also sped up slightly, not just the C100k.