Open nullreferencez opened 2 years ago
Hmm maybe somone else fixed that but it seems to work fine in the latest version see the added test can probably close this
That's because this encoder is actually for the older models. Doesn't match up with gpt-3.5-turbo or gpt-4.
If you want better accuracy for these newer models, see my package that started off as a fork of this one: gpt-tokenizer
.
@niieani wdyt about this one? https://github.com/dqbd/tiktoken is your implementation better?
hmm i cant reproduce build looks promising though
https://github.com/syonfox/GPT-3-Encoder/issues/6
notable pros no ts :) i think my build works simple only one version good enough estimation
@seyfer Tiktoken JS looks good too. My gpt-tokenizer has a few extra features though that might be useful to you (like checking whether a given text is within the token limit or not).
When a character like “ is used it will give back a faulty output as shown below.
encode('“wrote jack a letter”');
[null,222,250, 42910,14509,257,3850,null,222,251]
Whereas on openai it will give the output as:
[447, 250, 42910, 14509, 257, 3850, 447, 251]
This can be triggered by other characters like █ and many more.