go-ego / gse

Go efficient multilingual NLP and text segmentation; support English, Chinese, Japanese and others.
Apache License 2.0
2.57k stars 215 forks source link

sentence can choose tolower or keep origin sentence? #192

Open ivory2406 opened 1 day ago

ivory2406 commented 1 day ago

hello, I want to keep uppercase letter。 like example:

    text := "Hello world, Helloworld. Winter is coming! 你好世界."
    jieba := new(gse.Segmenter)
    jieba.LoadDict()
    res := jieba.Cut(text)
    println(ToJson(res))

}

the result is : ["hello"," ","world",","," ","helloworld","."," ","winter"," ","is"," ","coming","!"," ","你好","世界","."]

I hope the result is ["Hello"," ","world",","," ","Helloworld","."," ","Winter"," ","is"," ","coming","!"," ","你好","世界","."]


And I have seen the option params: https://github.com/go-ego/gse/blob/master/segmenter.go

image
ivory2406 commented 1 day ago

I want this can be set by params.

image
ivory2406 commented 1 day ago

@vcaesar Could you help me with the option param toLower? thanks very much

ivory2406 commented 15 hours ago

@CocaineCong hello, Could you help me with the option param toLower? bacause i want to use this gse for tokenize sentences and then use mmh3 to encode tokens.

the character is lowercase or uppercase, it's very important to me. Because words mmh3 value are different when they are lowercase or uppercase.