stanfordnlp / CoreNLP

CoreNLP: A Java suite of core NLP tools for tokenization, sentence segmentation, NER, parsing, coreference, sentiment analysis, etc.
http://stanfordnlp.github.io/CoreNLP/
GNU General Public License v3.0
9.72k stars 2.7k forks source link

Sometimes OpenIE does not work during Chinese processing #945

Closed hshsilver closed 5 years ago

hshsilver commented 5 years ago

the OpenIE module works so great in English that most sentence will get a result. However, when it comes to Chinese, the OpenIE is ineffective which means no result is shown at all. For example, here are the default sentence and result in English and Chinese:

The quick brown fox jumped over the lazy dog.

image

快速的棕色狐狸跳过了懒惰的狗。

image

The most strange part is that the Chinese text seems to get nothing.

BTW, 1. For the sentence "有才华的鲁迅住在上海。", we can get the result below:

image

However, for the sentence "有才华的鲁迅住在日本仙台。" in the same sentence structure, we get an incomplete result losing the location "日本仙台"。

image

2. For some sentences with “了”(aspect marker) as a word to express the past, such as "李先生购买手机", and other sentences with "一部"(classifier modifier) as a classifier phrase, such as "李先生购买一部手机", we get NOTHING. image

image

Just simple sentence "李先生购买手机" without any aspect marker and classifier modifier: image

So how can I do for correct OpenIE results?

AngledLuffa commented 5 years ago

OpenIE isn't really intended for languages other than English. It might occasionally show some results for other languages because of universal dependencies, but there are many rules based on the text itself which doesn't work for non-English languages.

On Tue, Sep 3, 2019 at 8:40 PM hshsilver notifications@github.com wrote:

the OpenIE module works so great in English that most sentence will get a result. However, when it comes to Chinese, the OpenIE is ineffective which means no result is shown at all. For example, here are the default sentence and result in English and Chinese:

The quick brown fox jumped over the lazy dog.

[image: image] https://user-images.githubusercontent.com/25773429/64223010-c5fcab00-cf04-11e9-9375-e8ba5985c955.png

快速的棕色狐狸跳过了懒惰的狗。

[image: image] https://user-images.githubusercontent.com/25773429/64223114-360b3100-cf05-11e9-8b0e-f840cd04e437.png

The most strange part is that the Chinese text seems to get nothing.

BTW, 1. For the sentence "有才华的鲁迅住在上海。", we can get the result below:

[image: image] https://user-images.githubusercontent.com/25773429/64223311-001a7c80-cf06-11e9-8cad-93c1e5a5f4a4.png

However, for the sentence "有才华的鲁迅住在日本仙台。" in the same sentence structure, we get an incomplete result losing the location "日本仙台"。

[image: image] https://user-images.githubusercontent.com/25773429/64223433-9babed00-cf06-11e9-9aba-9f785bafc698.png

1.

For some sentences with “了”(aspect marker) as a word to express the past, such as "李先生购买手机", and other sentences with "一部"(classifier modifier) as a classifier phrase, such as "李先生购买一部手机", we get NOTHING. [image: image] https://user-images.githubusercontent.com/25773429/64223881-16c1d300-cf08-11e9-9e1a-878f5ac659e7.png

[image: image] https://user-images.githubusercontent.com/25773429/64223897-2214fe80-cf08-11e9-909d-fb443ad28f52.png

Just simple sentence "李先生购买手机" without any aspect marker and classifier modifier: [image: image] https://user-images.githubusercontent.com/25773429/64223914-2e00c080-cf08-11e9-8f87-d293ec3a1195.png

So how can I do for correct OpenIE results?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/CoreNLP/issues/945?email_source=notifications&email_token=AA2AYWOXTISRWHAXNBBCDN3QH4U4NA5CNFSM4ITM7R6KYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HJELYMQ, or mute the thread https://github.com/notifications/unsubscribe-auth/AA2AYWMA4EOXO52E5Q3GN7DQH4U4NANCNFSM4ITM7R6A .

hshsilver commented 5 years ago

OpenIE isn't really intended for languages other than English. It might occasionally show some results for other languages because of universal dependencies, but there are many rules based on the text itself which doesn't work for non-English languages.

@AngledLuffa If I want to process Chinese, how can I add some rules in Chinese in order to support Chinese? For example, can I use the clause cutter first to cut sentences into simple sentences and then write some codes myself according to Chinese dependencies rules to get the relation? I am confused because there are little info about clause cutter tools.

AngledLuffa commented 5 years ago

It is possible that if you replace enough of the semgrex expressions that make up the annotator, you'll be able to get good results for Chinese. However, there are a lot of rules which expect English grammar, and there are a lot of rules which use English words directly. For example, take this one in RelationTripleSegmenter.java, line 46:

add(SemgrexPattern.compile("{$}=object >/.subj(:pass)?/ {}=subject

/cop|aux(:pass)?/ {}=verb ?>case {}=prep " + NOT_OVER_NOT_WORD));

This has assumptions about English words via the NOT_OVER_NOT_WORD pattern (ie, it is not over a word similar to "not"). It also has assumptions about the English grammar in that in English you say "Cats are very cute", but in Chinese you say (translated) "Cats very cute". So you'll have to go through each of these patterns and adjust them for the new language in order to get good results.

On Tue, Sep 3, 2019 at 10:45 PM hshsilver notifications@github.com wrote:

OpenIE isn't really intended for languages other than English. It might occasionally show some results for other languages because of universal dependencies, but there are many rules based on the text itself which doesn't work for non-English languages.

@AngledLuffa https://github.com/AngledLuffa If I want to process Chinese, should I add some rules in Chinese in order to support Chinese? For example, can I use the clause cutter first to cut sentences into simple sentences and then write some codes myself according to Chinese dependencies rules to get the relation? I am confused because there are little info about clause cutter tools.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/CoreNLP/issues/945?email_source=notifications&email_token=AA2AYWKFBD4NDLFZQ2P2YMTQH5DO5A5CNFSM4ITM7R6KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD52NIYY#issuecomment-527750243, or mute the thread https://github.com/notifications/unsubscribe-auth/AA2AYWMR3LGTDOTZ7DHV6B3QH5DO5ANCNFSM4ITM7R6A .