When using a user dictionary, how to split kanji

paulm17 commented 1 year ago

I'm using a user dictionary, an entry:

朝顔,朝 顔,あさ かお,あさ かお

I'm trying to split 朝顔 into 朝 and 顔. So they come as two different entries.

How do I achieve this?

Thanks

ikawaha commented 1 year ago

There was no way to get an entry in the user dictionary :p In v2.9.0, UserExtra() was added to get user dictionary information.

Sample code:

package main

import (
    "fmt"

    "github.com/ikawaha/kagome-dict/dict"
    "github.com/ikawaha/kagome-dict/ipa"
    "github.com/ikawaha/kagome/v2/tokenizer"
)

func main() {
    udict, err := dict.NewUserDict("user_dict.txt")
    if err != nil {
        panic(err)
    }
    t, err := tokenizer.New(ipa.Dict(), tokenizer.UserDict(udict), tokenizer.OmitBosEos())
    if err != nil {
        panic(err)
    }
    tokens := t.Analyze("朝顔が咲く", tokenizer.Extended)
    for _, v := range tokens {
        fmt.Printf("%s:\t%s", v.Surface, v.Features())
        if extra := v.UserExtra(); extra != nil {
            fmt.Printf("\t extra: tokens %+v, readings %+v", extra.Tokens, extra.Readings)
        }
        fmt.Println()
    }
}

Output:

朝顔: [あさ かお 朝/顔 あさ/かお]    extra: tokens [朝 顔], readings [あさ かお]
が:  [助詞 格助詞 一般 * * * が ガ ガ]
咲く: [動詞 自立 * * 五段・カ行イ音便 基本形 咲く サク サク]

paulm17 commented 1 year ago

Thank you for making the change! I really appreciate it. 🚀

I can confirm that it works! 🔥

Funny enough, it was working when concatenating kanji but not for my use case. As I was using similar code to yours.

Quick follow up. What's the difference between

tokens := t.Tokenize(kanji) - Which is what I was using before and

tokens := t.Analyze(kanji, tokenizer.Extended) - Which is what you have above.

Thanks!

ikawaha commented 1 year ago

kagome has some segmentation modes.

Normal: Regular segmentation
Search: Use a heuristic to do additional segmentation useful for search
Extended: Similar to search mode, but also uni-gram unknown words

see. https://github.com/ikawaha/kagome#segmentation-mode-for-search

t.Tokenize(s) is an alias of t.Analyze(s, tokenizer.Normal).

I'm sorry for the confusion caused by the use of tokenizer.Extended in the sample code above. Please choose the mode that best suits your environment (Normal or Search mode is recommended).

paulm17 commented 1 year ago

Will do. Thanks again!

ikawaha / kagome

When using a user dictionary, how to split kanji #293