Chinese text extraction is not correct

haifenghuang commented 6 years ago

As Title suggested, Below code:

keyword_processor := NewKeywordProcessor()
keyword_processor.AddKeywords("欢迎")
keyword_processor.AddKeywords("来")
keyword_processor.AddKeywords("北京")

result := keyword_processor.ExtractKeywords("欢迎来北京")

for _, v := range result {
    e := ExtractResult(v)
    fmt.Printf("return : %s\n", e.Keyword)
}

There is nothing in the output, because len(result) = 0.

If we change above keywords to english:

keyword_processor := NewKeywordProcessor()
keyword_processor.AddKeywords("welcome")
keyword_processor.AddKeywords("to")
keyword_processor.AddKeywords("beijing")

result := keyword_processor.ExtractKeywords("welcome to beijing")

for _, v := range result {
    e := ExtractResult(v)
    fmt.Printf("return : %s\n", e.Keyword)
}

The result is:

return : welcome
return : to
return : beijing

sundy-li commented 6 years ago

Sorry, It didn't support Chinese for now on.. English use spaces to separate words, but not Chinese. I did consider to add this feature in this repo, but at last I thought it will be better to build a new tool to extract Chinese sentences.

waltsmith88 commented 5 years ago

As Title suggested, Below code:

keyword_processor := NewKeywordProcessor()
keyword_processor.AddKeywords("欢迎")
keyword_processor.AddKeywords("来")
keyword_processor.AddKeywords("北京")

result := keyword_processor.ExtractKeywords("欢迎来北京")

for _, v := range result {
    e := ExtractResult(v)
    fmt.Printf("return : %s\n", e.Keyword)
}

There is nothing in the output, because len(result) = 0.

If we change above keywords to english:

keyword_processor := NewKeywordProcessor()
keyword_processor.AddKeywords("welcome")
keyword_processor.AddKeywords("to")
keyword_processor.AddKeywords("beijing")

result := keyword_processor.ExtractKeywords("welcome to beijing")

for _, v := range result {
    e := ExtractResult(v)
    fmt.Printf("return : %s\n", e.Keyword)
}

The result is:

return : welcome
return : to
return : beijing

Hi hiafenghuang I did a similar job in recent work about flashtext with Chinese support.

    keywordProcessor := gf.NewKeywordProcessor()
    keywordProcessor.AddKeyword("欢迎")
    keywordProcessor.AddKeyword("来")
    keywordProcessor.AddKeyword("北京")

    result := keywordProcessor.ExtractKeywords("欢迎来北京")

    for _, v := range result {
        fmt.Printf("return : %s\n", v)
    }

And the result is

return : 欢迎
return : 来
return : 北京

The package is here.

Besides, I used PyFlashtext which is also with similar Chinese problems and I fixed it. To improve the performance in my product env, I rewrite FlashText algorithm with go instead of python. And it works well. Welcome to use go-flashtext.

sundy-li / flashtext

Chinese text extraction is not correct #1