Properly count tokens for OpenAI

ghost commented 1 year ago

Hi! Thanks for the project, it works surprisingly nicely. I checked the code and it seems that for now you check how much context can fit with the character length, which is really limiting the capabilities. In OpenAI, especially with gpt-3.5-turbo, most English words are really 1 token, so with 3000 token context you can get ~3k English words (in reality a bit less, but still close), not characters. This means that searchGPT will be able to pass 2-3x more web page content to the request so the results will improve, albeit it'll cost more.

The best library to use for tokenization for OpenAI right now is https://github.com/openai/tiktoken. For older models (everything except gpt-3.5-turbo) you can use the gpt2 encoding, for Turbo you have to use cl100k_base.

michaelthwan commented 1 year ago

Thanks for your feedback. I've also notice this issue, especially it will happen if (many) users used non-ascii character (e.g. Chinese character) and the number of token grow very fast. Counting number of char is not a good method at all.

For the reason I used 3k char, not only it is quick and dirty to limit context length, but also I have a extremely rough conversion: 1 word = ~5 char, 0.75word=1 token => 3000 char maybe enough for most cases.

I will implement so later. (or please feel free to submit PR) Thanks.

michaelthwan commented 1 year ago

It is done (#78 ) Short analysis, the compression of english word to token is very good. It is an english example: if we use <3000 char, we only used ~634 tokens for input. Now we will use all.

For non-ASCII char, like Chinese, the case is different. For example, previously the max_char=3000 logic will throw error as it will eventually use 4426 token. Now it is safe. 2023-03-19 17_22_30-Window

Potentially tune the config for cheaper cost though

michaelthwan / searchGPT

Properly count tokens for OpenAI #74