There is nothing better than better documentation

ikawaha commented 2 years ago

KEINOS Thank you very much! Maybe this is obvious stuff and one is expected to know this, but I think it would be nice to include something like your comment in the README.

Originally posted by @CaptainDario in https://github.com/ikawaha/kagome/issues/274#issuecomment-1198047786

ikawaha commented 2 years ago

CaptainDario Indeed. There is nothing better than better documentation!

ikawaha, if the above explanation is ok, I would like to PR somewhere, where should I write? In the Wiki, maybe?

@KEINOS Thanks for the suggestion.

Would it be better to put the details in the wiki and link to it from the README? The wiki of this repository is open and you are free to add to it.

KEINOS commented 2 years ago

@ikawaha (cc: @CaptainDario)

The wiki of this repository is open and you are free to add to it.

Thank you!

Would it be better to put the details in the wiki and link to it from the README?

I agree. I would like to start with the "keywords per page" of the Wiki. For example, start with "wakati". We should think more about this when we have more keywords, shouldn't we?

ikawaha commented 2 years ago

I have no idea 😇 , so let's start with "wakati". There is extensive documentation on janome, which may be helpful.

KEINOS commented 1 year ago

@ikawaha (cc: @CaptainDario)

I have finally started editing the Wiki. But I think it is premature to link from the README.md, as I am just copying and pasting the issues.

Ideally, we would like to translate the official Japanese documentation into English. However, for the time being, it would be realistic to add topics one by one to the Wiki and later create a separate repository for kagome-doc.

Also, I was looking at the official documentation and thought that enriching ExampleXXX and godoc would be a Golang approach.

ikawaha commented 1 year ago

Thank you so much!

It's great! 🙏

Even at this point, we have a few Example tests, and it's a great idea to enrich ExampleXXX and godoc. ( '-`).oO( But, they may not work with go-playground because the build timed out 😇.

e.g. Example test for the word filter https://github.com/ikawaha/kagome/blob/a16f9337a12438750ff98d28a2ea8e23cb47ac4d/filter/word_test.go#L167-L187)

KEINOS commented 1 year ago

But, they may not work with go-playground because the build timed out 😇.

Yes, indeed. Go Playground is a no-go for kagome for now 😭

However, as long as godoc can run ExamplesXXX, it is worth including whenever possible.

https://pkg.go.dev/github.com/ikawaha/kagome/v2@v2.9.0/filter#example-WordFilter

How about creating an _example directory and putting some working examples there? Along with Wiki and godoc improvements, of course.

Example @ go-sqlite3
- https://github.com/mattn/go-sqlite3/tree/master/_example

ikawaha commented 1 year ago

How about creating an _example directory and putting some working examples there? Along with Wiki and godoc improvements, of course.

It sounds good 👍.

I created ./sample/_exmple folder for adding working examples in PR #296.

CaptainDario commented 1 year ago

@KEINOS I am currently playing around with the different dictionaries. While doing this I figured out that, when using unidic processing: 私は日本人です。

Results in [代名詞, , , , , , ワタクシ, 私-代名詞, 私, ワタクシ, 私, ワタクシ, 和, , , , ], [助詞-係助詞, 係助詞, , , , , ハ, は, は, , は, ワ, 和, , , , ], [名詞-固有名詞-地名-国, 固有名詞, 地名, 国, , , ニッポン, 日本, 日本, ニッポン, 日本, ニッポン, 固, , , , ], [接尾辞-名詞的-一般, 名詞的, 一般, , , , ニン, 人, 人, ニン, 人, ニン, 漢, , , , ], [助動詞, , , , 助動詞-デス, 終止形-一般, デス, です, です, , です, デス, 和, , , , ], [補助記号-句点, 句点, , , , , , 。, 。, , 。, , 記号, , , , *]

Notice: ワタクシ, ニン

However, when running with ipadic the result is

[名詞, 代名詞, 一般, , , , 私, ワタシ, ワタシ], [助詞, 係助詞, , , , , は, ハ, ワ], [名詞, 一般, , , , , 日本人, ニッポンジン, ニッポンジン], [助動詞, , , , 特殊・デス, 基本形, です, デス, デス], [記号, 句点, , , , , 。, 。, 。]

Notice: ワタシ, ニッポンジン

I think the results from using ipadic are clearly better. While I really appreciate your previous answer (and creating the wiki), could I ask you to elaborate a bit more what the disadvantages/advantages of the different dictionaries are? I though Accuracy of results: ipadic < unidic < neologd Size / speed: neologd < unidic < ipadic But that seems to not reallly hold.

KEINOS commented 1 year ago

@CaptainDario

I though Accuracy of results: ipadic < unidic < neologd Size / speed: neologd < unidic < ipadic But that seems to not reallly hold.

As you point out, the size of the dictionary is proportional to its speed, but not to its size and accuracy.

In my personal experience, I believe that they can be classified as follows:

Size: ipadic < unidic < neologd
Speed: neologd < unidic < ipadic
Accuracy:
- grammar analysis: unidic < ipadic < neologd
- word split by proper noun: ipadic < unidic < neologd
- word split by general-purpose: neologd < ipadic < unidic

This is because each dictionary is created for a different purpose and requires different precision.

what the disadvantages/advantages of the different dictionaries are?

tl; dr

In summary, IPADIC is typically used for grammatical analysis and UNIDIC for retrieval analysis. IPADIC is lightweight and accurate in most use cases and UNIDIC is good for word-splitting for word search purposes.

IPADIC is recommended when part of speech (PoS) is important.

For example, when PoS is used as an information vector for analysis, machine learning, or etc. And NEologd is a kind of IPADIC + user dictionary. This dictionary has been extended by the community to cover the new vocabulary missing in IPADIC. However, it is huge.

UNIDIC, on the other hand, is recommended when it is necessary to split a sentence into smaller example units for retrieval. Search engines, for example.

When a search engine needs to measure the distance between the divided units. Levenshtein distance or Cosine similarity for example. Or, using each unit ID (word ID? token?) as a discrete feature value for machine learning.

Depending on what and how you are analyzing, in my opinion, I would recommend using IPADIC plus a home-made user dictionary.

ts; dr

Disadvantage of UNIDIC

As you may have already experienced, you may be uncomfortable with the difference in accuracy and speed of delimitation. Compared to IPADIC, UNIDIC seems to be less accurate despite its larger amount of information (larger dictionary size).

$ # IPA DICT
$ time echo "私は日本人です。" | kagome -sysdict ipa
私   名詞,代名詞,一般,*,*,*,私,ワタシ,ワタシ
は   助詞,係助詞,*,*,*,*,は,ハ,ワ
日本人 名詞,一般,*,*,*,*,日本人,ニッポンジン,ニッポンジン
です  助動詞,*,*,*,特殊・デス,基本形,です,デス,デス
。   記号,句点,*,*,*,*,。,。,。
EOS

real    0m1.021s
user    0m1.114s
sys 0m0.090s

$ # UNI DICT
$ time echo "私は日本人です。" | kagome -sysdict uni
私   代名詞,*,*,*,*,*,ワタクシ,私-代名詞,私,ワタクシ,私,ワタクシ,和,*,*,*,*
は   助詞,係助詞,*,*,*,*,ハ,は,は,ワ,は,ワ,和,*,*,*,*
日本  名詞,固有名詞,地名,国,*,*,ニッポン,日本,日本,ニッポン,日本,ニッポン,固,*,*,*,*
人   接尾辞,名詞的,一般,*,*,*,ニン,人,人,ニン,人,ニン,漢,*,*,*,*
です  助動詞,*,*,*,助動詞-デス,終止形-一般,デス,です,です,デス,です,デス,和,*,*,*,*
。   補助記号,句点,*,*,*,*,,。,。,,。,,記号,*,*,*,*
EOS

real    0m4.807s
user    0m5.303s
sys 0m0.273s

The problem here is the difference between "日本人" and "日本, 人".

UNIDIC is a dictionary based on "short units" (短単位たんたんい) defined by the NINJAL to facilitate the collection of examples for the BCCWJ.

NINJAL (National Institute of Japanese Language and Linguistics)
BCCWJ (Balanced Corpus of Contemporary Written Japanese)

This "short units" is known that the division is too short to be used in "natural language processing" for syntactic and semantic analysis.

Thus, in most use cases, IPADIC is faster and more convenient. This is why my recommendation is to use IPADIC with a custom user dictionary.

Advantage and use cases of UNIDIC

An advantage of UNIDIC is the "consistency" in word segmentation.

The difference between the two dictionaries, IPA and UNI, is illustrated by a well-known example.

"りんごジュースを飲んだ。" vs "リンゴジュースを飲んだ。"

Both are correct and mean the same thing, such as "I drank apple juice".

But, sensibly, "りんごジュース" is easier to read than "リンゴジュース" because the words are visually separated (katakana-hiranaga-mixture vs all-in-katakana).

And both dictionaries include the word "りんご" and "リンゴ" as a noun (名詞).

$ # IPA DICT
$ echo "りんご" | kagome -sysdict ipa
りんご 名詞,一般,*,*,*,*,りんご,リンゴ,リンゴ
EOS

$ echo "リンゴ" | kagome -sysdict ipa
リンゴ 名詞,一般,*,*,*,*,リンゴ,リンゴ,リンゴ
EOS

$ # UNI DICT
$ echo "りんご" | kagome -sysdict uni
りんご 名詞,普通名詞,一般,*,*,*,リンゴ,林檎,りんご,リンゴ,りんご,リンゴ,漢,*,*,*,*
EOS

$ echo "リンゴ" | kagome -sysdict uni
リンゴ 名詞,普通名詞,一般,*,*,*,リンゴ,林檎,リンゴ,リンゴ,リンゴ,リンゴ,漢,*,*,*,*
EOS

And here comes the problem.

$ # IPA DICT
$ echo "りんごジュースを飲んだ。" | kagome -sysdict ipa
りん  副詞,助詞類接続,*,*,*,*,りん,リン,リン
ご   接頭詞,名詞接続,*,*,*,*,ご,ゴ,ゴ
ジュース    名詞,一般,*,*,*,*,ジュース,ジュース,ジュース
を   助詞,格助詞,一般,*,*,*,を,ヲ,ヲ
飲ん  動詞,自立,*,*,五段・マ行,連用タ接続,飲む,ノン,ノン
だ   助動詞,*,*,*,特殊・タ,基本形,だ,ダ,ダ
。   記号,句点,*,*,*,*,。,。,。
EOS

$ # UNI DICT
$ echo "りんごジュースを飲んだ。" | kagome -sysdict uni
りんご 名詞,普通名詞,一般,*,*,*,リンゴ,林檎,りんご,リンゴ,りんご,リンゴ,漢,*,*,*,*
ジュース    名詞,普通名詞,一般,*,*,*,ジュース,ジュース-juice,ジュース,ジュース,ジュース,ジュース,外,*,*,*,*
を   助詞,格助詞,*,*,*,*,ヲ,を,を,オ,を,オ,和,*,*,*,*
飲ん  動詞,一般,*,*,五段-マ行,連用形-撥音便,ノム,飲む,飲ん,ノン,飲む,ノム,和,*,*,*,*
だ   助動詞,*,*,*,助動詞-タ,終止形-一般,タ,た,だ,ダ,だ,ダ,和,*,*,*,*
。   補助記号,句点,*,*,*,*,,。,。,,。,,記号,*,*,*,*
EOS

Note the difference between "りん, ご" and "りんご".

IPADIC recognized "りんご" as an adverb/prefix (副詞/接頭詞) combination and UNIDIC as a noun (名詞).

The simplest solution, apart from registering a user dictionary, is to use katakana notation.

$ # IPADICT
$ echo "リンゴジュースを飲んだ。" | kagome -sysdict ipa
リンゴ 名詞,一般,*,*,*,*,リンゴ,リンゴ,リンゴ
ジュース    名詞,一般,*,*,*,*,ジュース,ジュース,ジュース
を   助詞,格助詞,一般,*,*,*,を,ヲ,ヲ
飲ん  動詞,自立,*,*,五段・マ行,連用タ接続,飲む,ノン,ノン
だ   助動詞,*,*,*,特殊・タ,基本形,だ,ダ,ダ
。   記号,句点,*,*,*,*,。,。,。
EOS

$ # UNIDICT
$ echo "リンゴジュースを飲んだ。" | kagome -sysdict uni
リンゴ 名詞,普通名詞,一般,*,*,*,リンゴ,林檎,リンゴ,リンゴ,リンゴ,リンゴ,漢,*,*,*,*
ジュース    名詞,普通名詞,一般,*,*,*,ジュース,ジュース-juice,ジュース,ジュース,ジュース,ジュース,外,*,*,*,*
を   助詞,格助詞,*,*,*,*,ヲ,を,を,オ,を,オ,和,*,*,*,*
飲ん  動詞,一般,*,*,五段-マ行,連用形-撥音便,ノム,飲む,飲ん,ノン,飲む,ノム,和,*,*,*,*
だ   助動詞,*,*,*,助動詞-タ,終止形-一般,タ,た,だ,ダ,だ,ダ,和,*,*,*,*
。   補助記号,句点,*,*,*,*,,。,。,,。,,記号,*,*,*,*
EOS

The difference is that IPADIC attempted to interpret them grammatically, while UNIDIC interpreted them in short units.

"日本人" (noun) vs "日本, 人" (noun + postfix)
"りん, ご, ジュース" (adverb + prefix + noun) vs "りんご, ジュース" (noun+noun)

In both cases, the latter delimitation is divided into units suitable for search engines, etc.

This means that "short units" are effective in unifying the units of "search examples" in search engines and other information retrieval systems.

Thus, UNIDIC has more advantage for word searching purposes.

Are you convinced by this explanation? > @CaptainDario Am I on the right track in my explanation? > @ikawaha

Let me know so I can fix it and add it to the Wiki.

CaptainDario commented 1 year ago

@KEINOS well first of all thank you for this very detailed explanation. It really helped me a lot! I think this should definitely be added to the wiki, for starters this is gold.

In your opinion, is neologd worth it over standard ipadic for Japanese NLP?

KEINOS commented 1 year ago

It really helped me a lot! I think this should definitely be added to the wiki, for starters this is gold.

I'm glad to hear that! So far so good. 👍

In your opinion, is neologd worth it over standard ipadic for Japanese NLP?

Neologd is a great dictionary. However, for my current usage, I choose IPADIC. If speed is not important, it is worth using Neologd, which is just an extension of IPADIC.

Actually, there is a Japanese text linter implemented in Javascript, but due to speed issues and the need to install Node.js separately, I was secretly struggling to implement it in Go with Kagome.

However, the dictionary lookup part seems to be the bottleneck, and even a simple test implementation using Neologed, its speed is not as good as the original Textlint. So I'm currently losing motivation to build a text linter in Go.

I wish I could help speed up Kagome, but I just started learning Go in earnest after this Corona disaster thing, so I can't keep up with its technology yet. 😭 Documenting is the only thing I can contribute for now.

KEINOS commented 1 year ago

@CaptainDario (cc: @ikawaha )

FYI, I added the FAQ and a document about it to the wiki. Feel free to fix them!

KEINOS commented 5 months ago

JFYI. I added the below article to the Wiki.

Kagome As a Server Side Tokenizer | Wiki | kagome @ GitHub

ikawaha commented 5 months ago

@KEINOS Thank you for the very clear explanation. I appreciate your contribution.

ikawaha / kagome