Is there a way to get the readings for each character

CaptainDario commented 2 years ago

Thanks again for this awesome project and your help!

I got started using kagome and now I am wondering if there is a way to get the readings of individual characters. For example:

input: 日本経済新聞
output: 日 - に; 本 - ほん;  経 - けい; 済 - ざい; 新 - しん; 聞 - ぶん;

instead of

input: 日本経済新聞
output: にほんけいざいしんぶん

ikawaha commented 2 years ago

Since readings are assigned to morphemes in the dictionary, it is not possible to retrieve the reading for each kanji character.

In the example, 「日本経済新聞」 is registered as a noun in the dictionary, so it is retrieved as a single morpheme.

If you use the analyze mode for search engines (search mode), the morpheme segmentation unit can be smaller than usual.

Note that if a morpheme is not registered in the dictionary (unknown word), the reading for that morpheme is not available.

package main

import (
    "fmt"

    "github.com/ikawaha/kagome-dict/ipa"
    "github.com/ikawaha/kagome/v2/tokenizer"
)

func main() {
    t, err := tokenizer.New(ipa.Dict(), tokenizer.OmitBosEos())
    if err != nil {
        panic(err)
    }
    // normal mode
    fmt.Println("--- normal mode ---")
    tokens := t.Analyze("日本経済新聞", tokenizer.Normal)
    for _, token := range tokens {
        reading, _ := token.Reading()
        fmt.Printf("%s - %s; ", token.Surface, reading)
    }
    fmt.Println()

    // search mode
    fmt.Println("--- search mode ---")
    tokens = t.Analyze("日本経済新聞", tokenizer.Search)
    for _, token := range tokens {
        reading, _ := token.Reading()
        fmt.Printf("%s - %s; ", token.Surface, reading)
    }
    fmt.Println()
}

--- normal mode ---
日本経済新聞 - ニホンケイザイシンブン;
--- search mode ---
日本 - ニッポン; 経済 - ケイザイ; 新聞 - シンブン;

CaptainDario commented 2 years ago

Sad that this is not possible, but thank you very much for your help!

KEINOS commented 1 year ago

@CaptainDario (cc: @paulm17, related: #293)

Readings of each character (how to split kanjis)

I understand, I've been there too. The difficulty lies in the fact that kanjis are often pronounced differently depending on the sequence of letters.

e.g.
- 日本経済 → nihon keizai（ニホンケイザイ)
- 日本 → hippon（ニッポン） or nihon（ニホン）

Another complicating factor is the difference in reading styles, such as "on-yomi"（音読み） and "kun-yomi"（訓読み）.

The reading "on-yomi" is closer to the Chinese pronunciation from which the kanjis are derived. On the other hand, "kun-yomi" is the conventional Japanese reading of each kanji, which expresses its meaning.

e.g.
- 日
- on_yomi: "ニチ", "ジツ"
- kun_yomi: "ひ", "か"

However, there are basic readings for Kanjis. And luckily, in Go we have rune to extract kanji characters from strings.

for _, r := range input {
    if unicode.In(r, unicode.Han) {
        fmt.Println(r, "is a kanji")
    }
}

So, I prepared a JSON file to help find out the basic readings of a kanji.

https://gist.github.com/KEINOS/fb660943484008b7f5297bb627e0e1b1

Hope this helps. 🤞

ikawaha / kagome

Is there a way to get the readings for each character #276