rajentrivedi / tokenizer-x

OpenAI token calculator
MIT License
65 stars 5 forks source link

[Bug]: Calculate wrong number of token when input string is Vietnamese #31

Closed leo270323 closed 4 hours ago

leo270323 commented 5 hours ago

What happened?

Hello, As a mention on the title, i found a bug when using TokenizerX to calculate the number of token. If input string is English or Japanese then TokenizerX working well but if input string is Vietnamese its wrong This is my Vietnamese input string : "Xin Chào. Bạn có khỏe không", i use this tool : https://platform.openai.com/tokenizer to calculate the number of token and it return 13, but i use TokenizerX::count('Xin Chào. Bạn có khỏe không') its return 21

How to reproduce the bug

This is my Vietnamese input string : "Xin Chào. Bạn có khỏe không", i use this tool : https://platform.openai.com/tokenizer to calculate the number of token and it return 13, but i use TokenizerX::count('Xin Chào. Bạn có khỏe không') its return 21

Package Version

1.0.3

PHP Version

8.2

Laravel Version

11.16.0

Which operating systems does with happen with?

No response

Notes

No response

leo270323 commented 4 hours ago

Sorry. Looks like that's my mistake. I have forgot second param when using count() method. I should be like this TokenizerX::count('Xin Chào. Bạn có khỏe không', 'gpt-4'); Not : TokenizerX::count('Xin Chào. Bạn có khỏe không'); Everything working well now. Sorry and thank you so much.