pkoukk / tiktoken-go

go version of tiktoken
MIT License
601 stars 67 forks source link

Optimize encodeNative() #20

Closed bakks closed 1 year ago

bakks commented 1 year ago

I was using this encoder for https://github.com/bakks/butterfish and found that the encoder scales very poorly as the string gets longer. I did a CPU profile (see image below) and found that the most expensive operation was this specific line: https://github.com/pkoukk/tiktoken-go/blob/03647ca046a15c80510e2a135a2dacdf31320ca1/core_bpe.go#L168

Screen Shot 2023-06-01 at 6 58 18 PM

The encodeNative() function calls cutTextInRune() a bunch of times, which casts a string to []rune. That cast probably copies everything in memory, which for a small string is pretty negligible but if the string gets longer then it becomes more expensive. And this cast is being duplicated on the same string unnecessarily, so it currently scales superlinearly with string size, i.e. as the strings get longer it gets very very slow.

I made some changes and benchmarked the results. For a string of ~589000 bytes the old implementation basically never finishes, the new implementation takes about 0.2 seconds.

Please double check me and merge if useful.

pkoukk commented 1 year ago

I appreciate your contribution. This is indeed a significant improvement.