Open christopher-hesse opened 1 year ago
I had a version using a trie to do single-pass-ish encoding of an input, but it wasn't correct. I'm not certain how fast a correct version of that trie would be.
Thanks, those are really nice results!
Previous script, full encode: csh_bpe 16592057.7610188 bytes / s => 60.3 ns/byte
Previous script, splitting only (commented out the bigram part): csh_bpe 104345021.38634916 bytes / s => 9.6 ns/byte
out_tokens = np.empty(input.shape, dtype=np.int32)
. This does cost more memory during encoding, though the caller can copy the used part of the array afterward if they want. Also unclear to me if this has any measurable performance advantage.Feel free to close this if the ideas have been ideated.
Hello,
It seems that the slow performance is due to an ineffective implementation of the negative lookahead clause ("\s+(?!\S)") in the fancy_regex library.
A possible solution to mimic the negative lookahead functionality is to remove it from the regex and manually re-add spaces to the matched parts, such as words or numbers. Although this approach achieves the same performance as pcre2, it may not be the most elegant solution.
I'm currently working on optimizing the tokenizer and the token counter (on the Java implementation at https://github.com/knuddelsgmbh/jtokkit, but most of the tricks should be applicable to other implementations as well).
Benchmark (dataFolderPath) Mode Cnt Score Error Units
SingleThreadedBenchmark.benchmarkCl100kBaseTokenCountOriginal data ss 10 6.503 ± 0.053 s/op
SingleThreadedBenchmark.benchmarkCl100kBaseTokenCount data ss 10 2.094 ± 0.042 s/op
So far it's 3x faster, but I still have a few ideas left. I'll check after, if the recommendations here are applicable or not.
I made a toy GPT2 tokenizer as a python rust extension. It seems to be slightly faster than tiktoken in my tests. It looks like https://github.com/openai/tiktoken/pull/31 may get most or all the way there, but I thought I'd post the results from this script:
The text is 64MiB of wikipedia wikitext, probably enwik8, but I just found it on my hard drive.
There are no fancy optimizations here (like SIMD stuff), the library has a few things it might do differently from tiktoken:
1) The word splitting regular expression is implemented using rust code instead of a regexp library. It uses Go's unicode tables: https://github.com/golang/go/blob/19309779ac5e2f5a2fd3cbb34421dafb2855ac21/src/unicode/tables.go and this seems to produce the same output at least for this 64MB file. The splitting is done with a function that takes a u8 numpy array and start offset and returns the end offset. 2) The bigram encoder takes a u8 slice for the word, a HashMap<(i32, i32), i32> mergelist, an i32 slice mapping bytes to tokens (used to populate the initial output), and a mutable i32 slice of output tokens. It keeps a list of skip lengths for each index of the output tokens (initially all 1s), which it updates whenever it merges two tokens together, then compacts the output tokens when it is done. 3) (I think tiktoken does this) after splitting, before encoding a word, it will check the vocab hashmap to see if the word is already a single token. 4) The interface uses numpy arrays instead of bytes, and the output array is provided as one of the inputs so the caller can manage more memory allocations (not sure if this has any performance impact)
I didn't implement rust regexps so I don't know if the word splitting matters, though I could benchmark just the splitting part.