neologd / mecab-ipadic-neologd

Neologism dictionary based on the language resources on the Web for mecab-ipadic
Other
2.7k stars 288 forks source link

Negative cost #64

Closed kota7 closed 4 years ago

kota7 commented 4 years ago

Thanks first for the great database.

Motivation

I find some words in the data are assigned negative costs.

$ cat mecab-ipadic-neologd/build/mecab-ipadic-2.7.0-20070801-neologd-20191111/mecab-user-dict-seed.20191111.csv | grep "ファニチャーロウ"
ファニチャーロウレーシング,1288,1288,-5111,名詞,固有名詞,一般,*,*,*,ファニチャー・ロウ・レーシング,ファニチャーロウレーシング,ファニチャーロウレーシング
ファニチャー・ロウ・レーシング,1288,1288,-9029,名詞,固有名詞,一般,*,*,*,ファニチャー・ロウ・レーシング,ファニチャーロウレーシング,ファニチャーロウレーシング

Costs are lower for more frequent words. But the examples above do not seem to be so frequent as assigned a very low cost. I suspect this could possibly be a result of integer overflow or sort.

Goal

I would like to know: (1) if this is a correct/intended result or a bug (2) if correct/intended, how negative costs should be interpreted.

Can someone help me with this?

neologd commented 4 years ago

Thank you for your frank question.

In conclusion, we think this case is correct and not a bug. And using negative integer values in the range of 2-byte integers as a cost value conform to the IPADIC specification.

Also, the cost value given to each words are not necessarily based on the frequency of word observation in the real world or in the corpus.

Chapter 5, (P 79 -) in the following book will help you understand how different cost values are used in the analysis process.

https://www.amazon.co.jp/dp/B07J1NBNYW/ref=tmm_kin_swatch_0

If you don't have this book, we strongly recommend you to read it.

Also a following slide (P9 -) by same author is very helpful for you.

https://www.jtpa.org/wp-content/uploads/2014/06/MeCab.pdf

Thank you very much.

kota7 commented 4 years ago

Thanks for the answer. This helps a lot.