Closed CaptainDario closed 2 years ago
Since readings are assigned to morphemes in the dictionary, it is not possible to retrieve the reading for each kanji character.
In the example, 「日本経済新聞」 is registered as a noun in the dictionary, so it is retrieved as a single morpheme.
If you use the analyze mode for search engines (search mode), the morpheme segmentation unit can be smaller than usual.
Note that if a morpheme is not registered in the dictionary (unknown word), the reading for that morpheme is not available.
package main
import (
"fmt"
"github.com/ikawaha/kagome-dict/ipa"
"github.com/ikawaha/kagome/v2/tokenizer"
)
func main() {
t, err := tokenizer.New(ipa.Dict(), tokenizer.OmitBosEos())
if err != nil {
panic(err)
}
// normal mode
fmt.Println("--- normal mode ---")
tokens := t.Analyze("日本経済新聞", tokenizer.Normal)
for _, token := range tokens {
reading, _ := token.Reading()
fmt.Printf("%s - %s; ", token.Surface, reading)
}
fmt.Println()
// search mode
fmt.Println("--- search mode ---")
tokens = t.Analyze("日本経済新聞", tokenizer.Search)
for _, token := range tokens {
reading, _ := token.Reading()
fmt.Printf("%s - %s; ", token.Surface, reading)
}
fmt.Println()
}
--- normal mode ---
日本経済新聞 - ニホンケイザイシンブン;
--- search mode ---
日本 - ニッポン; 経済 - ケイザイ; 新聞 - シンブン;
Sad that this is not possible, but thank you very much for your help!
@CaptainDario (cc: @paulm17, related: #293)
Readings of each character (how to split kanjis)
I understand, I've been there too. The difficulty lies in the fact that kanji
s are often pronounced differently depending on the sequence of letters.
nihon keizai
(ニホンケイザイ)hippon
(ニッポン) or nihon
(ニホン)Another complicating factor is the difference in reading styles, such as "on-yomi"(音読み) and "kun-yomi"(訓読み).
The reading "on-yomi" is closer to the Chinese pronunciation from which the kanjis are derived. On the other hand, "kun-yomi" is the conventional Japanese reading of each kanji, which expresses its meaning.
on_yomi
: "ニチ", "ジツ"kun_yomi
: "ひ", "か"However, there are basic readings for Kanjis. And luckily, in Go we have rune
to extract kanji characters from strings.
for _, r := range input {
if unicode.In(r, unicode.Han) {
fmt.Println(r, "is a kanji")
}
}
So, I prepared a JSON file to help find out the basic readings of a kanji.
https://gist.github.com/KEINOS/fb660943484008b7f5297bb627e0e1b1
Hope this helps. 🤞
Thanks again for this awesome project and your help!
I got started using kagome and now I am wondering if there is a way to get the readings of individual characters. For example:
instead of