Unusable and does not match with token output from GPT-3

latitudegames / GPT-3-Encoder

Javascript BPE Encoder Decoder for GPT-2 / GPT-3

MIT License

716 stars 196 forks source link

Unusable and does not match with token output from GPT-3 #9

Open nullreferencez opened 2 years ago

nullreferencez commented 2 years ago

When a character like “ is used it will give back a faulty output as shown below.

encode('“wrote jack a letter”');

[null,222,250, 42910,14509,257,3850,null,222,251]

Whereas on openai it will give the output as:

[447, 250, 42910, 14509, 257, 3850, 447, 251]

This can be triggered by other characters like █ and many more.

syonfox commented 1 year ago

Hmm maybe somone else fixed that but it seems to work fine in the latest version see the added test can probably close this

syonfox commented 1 year ago

https://github.com/syonfox/GPT-3-Encoder/actions/runs/3776876895

niieani commented 1 year ago

That's because this encoder is actually for the older models. Doesn't match up with gpt-3.5-turbo or gpt-4. If you want better accuracy for these newer models, see my package that started off as a fork of this one: gpt-tokenizer.

seyfer commented 1 year ago

@niieani wdyt about this one? https://github.com/dqbd/tiktoken is your implementation better?

syonfox commented 1 year ago

hmm i cant reproduce build looks promising though

https://github.com/syonfox/GPT-3-Encoder/issues/6

notable pros no ts :) i think my build works simple only one version good enough estimation

niieani commented 1 year ago

@seyfer Tiktoken JS looks good too. My gpt-tokenizer has a few extra features though that might be useful to you (like checking whether a given text is within the token limit or not).