go-ego / gse

Go efficient multilingual NLP and text segmentation; support English, Chinese, Japanese and others.
Apache License 2.0
2.6k stars 215 forks source link

In Chinese word segmentation, only a single word is separated #176

Open xiaominger opened 1 year ago

xiaominger commented 1 year ago

Execute the following code (tabooSegmentCustomDicList there are more than 2000 words) ` for _, tabooSegmentCustomDic := range tabooSegmentCustomDicList { lowerCaseWord := strings.ToLower(tabooSegmentCustomDic.Word) segmentutil.AddWord(lowerCaseWord) }

func AddWord(word string) bool { defer recoverPanic(word) err := seg.AddToken(word, 100) if err != nil { logger.Errorf("Error when AddWord,%s", word, err) return false } return true }

func TextSegment(text string) []string { defer recoverPanic(text) return seg.Cut(text) }

`

TextSegment("api发送文本loumès 𝘾𝘼𝙍𝙏𝙄𝙀𝙍")

the result is ["api","发","送","文","本","lou","mès"," ","𝘾𝘼𝙍𝙏𝙄𝙀𝙍"]

zwj186 commented 1 year ago

Please set 'DefaultAnalyzer' to 'cjk. AnalyzerName' will resolve the issue.

kms9 commented 5 months ago

how to set DefaultAnalyzer , search all repo files, no find this keyword/setting