Closed ghost closed 1 year ago
Thanks for your feedback. I've also notice this issue, especially it will happen if (many) users used non-ascii character (e.g. Chinese character) and the number of token grow very fast. Counting number of char is not a good method at all.
For the reason I used 3k char, not only it is quick and dirty to limit context length, but also I have a extremely rough conversion: 1 word = ~5 char, 0.75word=1 token => 3000 char maybe enough for most cases.
I will implement so later. (or please feel free to submit PR) Thanks.
It is done (#78 ) Short analysis, the compression of english word to token is very good. It is an english example: if we use <3000 char, we only used ~634 tokens for input. Now we will use all.
For non-ASCII char, like Chinese, the case is different. For example, previously the max_char=3000 logic will throw error as it will eventually use 4426 token. Now it is safe.
Potentially tune the config for cheaper cost though
Hi! Thanks for the project, it works surprisingly nicely. I checked the code and it seems that for now you check how much context can fit with the character length, which is really limiting the capabilities. In OpenAI, especially with
gpt-3.5-turbo
, most English words are really 1 token, so with 3000 token context you can get ~3k English words (in reality a bit less, but still close), not characters. This means that searchGPT will be able to pass 2-3x more web page content to the request so the results will improve, albeit it'll cost more.The best library to use for tokenization for OpenAI right now is https://github.com/openai/tiktoken. For older models (everything except
gpt-3.5-turbo
) you can use thegpt2
encoding, for Turbo you have to usecl100k_base
.