pkoukk / tiktoken-go

go version of tiktoken
MIT License
601 stars 67 forks source link

Incorrect calculation for Chinese characters #36

Closed hyp530 closed 10 months ago

hyp530 commented 10 months ago

The number of tokens deviates a lot comparing to https://platform.openai.com/tokenizer.

package main

import (
    "fmt"
    "github.com/pkoukk/tiktoken-go"
)

func main() {
    text := "这是一个测试"
    tke, _ := tiktoken.GetEncoding("cl100k_base")
    token := tke.Encode(text, nil, nil)
    fmt.Println(len(token)) // Result:  4
}

The result is 10 as generated by OpenAI Tokenizer .

pkoukk commented 10 months ago

OpenAI Tokenizer used gpt-3, it model is p50k_base.