takuyaa / kuromoji.js

JavaScript implementation of Japanese morphological analyzer
832 stars 117 forks source link

Not getting the same results as Kuromoji java #23

Closed Citronelol closed 6 years ago

Citronelol commented 6 years ago

Hi,

I was trying to tokenize the following sentence :

第1条 この法人は、一般社団法人国際銀行協会(以下「本協会」という。)と称し、英文では、 International Bankers Association of Japanと記載する。

and the results are different when using the java version of kuromojin (with Ipadic dictionary) and the tokenizer provided by kuromoji.js. In particular, the following sequence 協会 is splitted in kuromoji.js.

I saw a closed issue (#16) stating this could due to the Viterbi version of the tokenizer. Is there a way to disable it ?

Many thanks in advance,

Best

DJTB commented 6 years ago

It appears to be an preference issue, it's matching both 協 and 会 as 接尾 (suffix) before the whole word.

16 is matching 人名 (name) for 研 and 究 before the whole word.

Perhaps the matching algorithm needs to favor longer tokens before splitting into finer matches.

takuyaa commented 6 years ago

@Citronelol I released fixed version of 0.1.2, and deployed the demo site https://takuyaa.github.io/kuromoji.js/demo/tokenize.html FYI @DJTB

Citronelol commented 5 years ago

Thanks a lot !