ikawaha / kagome

Self-contained Japanese Morphological Analyzer written in pure Go
MIT License
818 stars 53 forks source link

Is there a way to get the readings for each character #276

Closed CaptainDario closed 2 years ago

CaptainDario commented 2 years ago

Thanks again for this awesome project and your help!

I got started using kagome and now I am wondering if there is a way to get the readings of individual characters. For example:

input: 日本経済新聞
output: 日 - に; 本 - ほん;  経 - けい; 済 - ざい; 新 - しん; 聞 - ぶん; 

instead of

input: 日本経済新聞
output: にほんけいざいしんぶん
ikawaha commented 2 years ago

Since readings are assigned to morphemes in the dictionary, it is not possible to retrieve the reading for each kanji character.

In the example, 「日本経済新聞」 is registered as a noun in the dictionary, so it is retrieved as a single morpheme.

If you use the analyze mode for search engines (search mode), the morpheme segmentation unit can be smaller than usual.

Note that if a morpheme is not registered in the dictionary (unknown word), the reading for that morpheme is not available.

package main

import (
    "fmt"

    "github.com/ikawaha/kagome-dict/ipa"
    "github.com/ikawaha/kagome/v2/tokenizer"
)

func main() {
    t, err := tokenizer.New(ipa.Dict(), tokenizer.OmitBosEos())
    if err != nil {
        panic(err)
    }
    // normal mode
    fmt.Println("--- normal mode ---")
    tokens := t.Analyze("日本経済新聞", tokenizer.Normal)
    for _, token := range tokens {
        reading, _ := token.Reading()
        fmt.Printf("%s - %s; ", token.Surface, reading)
    }
    fmt.Println()

    // search mode
    fmt.Println("--- search mode ---")
    tokens = t.Analyze("日本経済新聞", tokenizer.Search)
    for _, token := range tokens {
        reading, _ := token.Reading()
        fmt.Printf("%s - %s; ", token.Surface, reading)
    }
    fmt.Println()
}
--- normal mode ---
日本経済新聞 - ニホンケイザイシンブン;
--- search mode ---
日本 - ニッポン; 経済 - ケイザイ; 新聞 - シンブン;
CaptainDario commented 2 years ago

Sad that this is not possible, but thank you very much for your help!

KEINOS commented 1 year ago

@CaptainDario (cc: @paulm17, related: #293)

Readings of each character (how to split kanjis)

I understand, I've been there too. The difficulty lies in the fact that kanjis are often pronounced differently depending on the sequence of letters.

Another complicating factor is the difference in reading styles, such as "on-yomi"(音読み) and "kun-yomi"(訓読み).

The reading "on-yomi" is closer to the Chinese pronunciation from which the kanjis are derived. On the other hand, "kun-yomi" is the conventional Japanese reading of each kanji, which expresses its meaning.

However, there are basic readings for Kanjis. And luckily, in Go we have rune to extract kanji characters from strings.

for _, r := range input {
    if unicode.In(r, unicode.Han) {
        fmt.Println(r, "is a kanji")
    }
}

So, I prepared a JSON file to help find out the basic readings of a kanji.

https://gist.github.com/KEINOS/fb660943484008b7f5297bb627e0e1b1

Hope this helps. 🤞