Closed paulm17 closed 1 year ago
There was no way to get an entry in the user dictionary :p In v2.9.0, UserExtra() was added to get user dictionary information.
Sample code:
package main
import (
"fmt"
"github.com/ikawaha/kagome-dict/dict"
"github.com/ikawaha/kagome-dict/ipa"
"github.com/ikawaha/kagome/v2/tokenizer"
)
func main() {
udict, err := dict.NewUserDict("user_dict.txt")
if err != nil {
panic(err)
}
t, err := tokenizer.New(ipa.Dict(), tokenizer.UserDict(udict), tokenizer.OmitBosEos())
if err != nil {
panic(err)
}
tokens := t.Analyze("朝顔が咲く", tokenizer.Extended)
for _, v := range tokens {
fmt.Printf("%s:\t%s", v.Surface, v.Features())
if extra := v.UserExtra(); extra != nil {
fmt.Printf("\t extra: tokens %+v, readings %+v", extra.Tokens, extra.Readings)
}
fmt.Println()
}
}
Output:
朝顔: [あさ かお 朝/顔 あさ/かお] extra: tokens [朝 顔], readings [あさ かお]
が: [助詞 格助詞 一般 * * * が ガ ガ]
咲く: [動詞 自立 * * 五段・カ行イ音便 基本形 咲く サク サク]
Thank you for making the change! I really appreciate it. 🚀
I can confirm that it works! 🔥
Funny enough, it was working when concatenating kanji but not for my use case. As I was using similar code to yours.
Quick follow up. What's the difference between
tokens := t.Tokenize(kanji) - Which is what I was using before and
tokens := t.Analyze(kanji, tokenizer.Extended) - Which is what you have above.
Thanks!
kagome has some segmentation modes.
see. https://github.com/ikawaha/kagome#segmentation-mode-for-search
t.Tokenize(s)
is an alias of t.Analyze(s, tokenizer.Normal)
.
I'm sorry for the confusion caused by the use of tokenizer.Extended
in the sample code above. Please choose the mode that best suits your environment (Normal or Search mode is recommended).
Will do. Thanks again!
I'm using a user dictionary, an entry:
I'm trying to split 朝顔 into 朝 and 顔. So they come as two different entries.
How do I achieve this?
Thanks