Closed kota7 closed 4 years ago
Thank you for your frank question.
In conclusion, we think this case is correct and not a bug. And using negative integer values in the range of 2-byte integers as a cost value conform to the IPADIC specification.
Also, the cost value given to each words are not necessarily based on the frequency of word observation in the real world or in the corpus.
Chapter 5, (P 79 -) in the following book will help you understand how different cost values are used in the analysis process.
https://www.amazon.co.jp/dp/B07J1NBNYW/ref=tmm_kin_swatch_0
If you don't have this book, we strongly recommend you to read it.
Also a following slide (P9 -) by same author is very helpful for you.
https://www.jtpa.org/wp-content/uploads/2014/06/MeCab.pdf
Thank you very much.
Thanks for the answer. This helps a lot.
Thanks first for the great database.
Motivation
I find some words in the data are assigned negative costs.
Costs are lower for more frequent words. But the examples above do not seem to be so frequent as assigned a very low cost. I suspect this could possibly be a result of integer overflow or sort.
Goal
I would like to know: (1) if this is a correct/intended result or a bug (2) if correct/intended, how negative costs should be interpreted.
Can someone help me with this?