Fixed false positive for "食べられる" (potential verb) by JapaneseBrokenExpression

hirokiky commented 3 years ago

JapaneseBrokenExpression would cause false positives for potential verb. It would assert error for "食べられる", "見られる", "寝られる", and so on.

What I did:

Added tests for checking this
Fixed code (JapaneseBrokenExpression only).
Added special case handling

About special case

The tokenizer will parse ”見れる" as one token. It might depends on dicts of Kuromoji or so.

This issue looks similar https://github.com/takuyaa/kuromoji.js/issues/28

I think we need to add other special cases (if there are).

About baseForm of tokens

It's better to use BaseForm in this logic

q.getSurface().startsWith("られ")

Best way.

q.getBaseForm().equals("られる")

But to get baseForm, we need to change TokenElement and NoelogdJapaneseTokenizer. It looks other Validators won't use BaseForm of each tokens, and it's only necessary with Japanese. So in this PullRequest, I avoided to change them.

As ScreenShot

Before

スクリーンショット 2020-10-16 13-37-20

After

スクリーンショット 2020-10-16 13-37-54

Note

I'm not good at Java, so feel free to change my code and syntax as you like.

coveralls commented 3 years ago

Coverage increased (+0.003%) to 91.41% when pulling 4ac46d3ceef99e2c73fbc20d0594001bcbbc0622 on hirokiky:fix-jp-broken-ra into 4ea76e6c1364d6c3236a3b17662c48ae9e3a55d1 on redpen-cc:master.

norm-ideal commented 3 years ago

I am also working on the subject and I agree with you that the best way is to introduce BaseForm information, even though it is required only by Japanese. How about like this?

https://github.com/norm-ideal/redpen/tree/ra-drop

takahi-i commented 3 years ago

LGTM! Thank you very much for the valuable contribution @hirokiky 🙏

hirokiky commented 3 years ago

Thanks for the quick review!

redpen-cc / redpen