stanfordnlp / stanza

Stanford NLP Python library for tokenization, sentence segmentation, NER, and parsing of many human languages
https://stanfordnlp.github.io/stanza/
Other
7.26k stars 888 forks source link

[QUESTION] Why doesn't Stanza annotate the demonstrative pronoun in Chinese? #310

Closed abhisheknovoic closed 4 years ago

abhisheknovoic commented 4 years ago

Hello,

Firstly, thanks for the amazing support for other languages in Stanza. I am using Stanza to parse Chinese sentences.

The sentence is

这是我妈妈的戒指

which in English means,

This is my mother's ring

I have a properties file which looks as follows:


annotators = tokenize, ssplit, pos, lemma, ner, parse, coref

outputFormat = json

# segment
tokenize.language = zh
segment.model = edu/stanford/nlp/models/segmenter/chinese/ctb.gz
segment.sighanCorporaDict = edu/stanford/nlp/models/segmenter/chinese
segment.serDictionary = edu/stanford/nlp/models/segmenter/chinese/dict-chris6.ser.gz
segment.sighanPostProcessing = true

# sentence split
ssplit.boundaryTokenRegex = [.。]|[!?!?]+

# pos
pos.model = edu/stanford/nlp/models/pos-tagger/chinese-distsim.tagger

# ner
ner.language = chinese
ner.model = edu/stanford/nlp/models/ner/chinese.misc.distsim.crf.ser.gz
ner.applyNumericClassifiers = true
ner.useSUTime = false

# regexner
ner.fine.regexner.mapping = edu/stanford/nlp/models/kbp/chinese/gazetteers/cn_regexner_mapping.tab
ner.fine.regexner.noDefaultOverwriteLabels = CITY,COUNTRY,STATE_OR_PROVINCE

# parse
parse.model = edu/stanford/nlp/models/lexparser/chinesePCFG.ser.gz

# depparse
depparse.model    = edu/stanford/nlp/models/parser/nndep/UD_Chinese.gz
depparse.language = chinese

# coref
coref.sieves = ChineseHeadMatch, ExactStringMatch, PreciseConstructs, StrictHeadMatch1, StrictHeadMatch2, StrictHeadMatch3, StrictHeadMatch4, PronounMatch
coref.input.type = raw
coref.postprocessing = true
coref.calculateFeatureImportance = false
coref.useConstituencyTree = true
coref.useSemantics = false
coref.algorithm = hybrid
coref.path.word2vec =
coref.language = zh
coref.defaultPronounAgreement = true
coref.zh.dict = edu/stanford/nlp/models/dcoref/zh-attributes.txt.gz
coref.print.md.log = false
coref.md.type = RULE
coref.md.liberalChineseMD = false

# kbp
kbp.semgrex = edu/stanford/nlp/models/kbp/chinese/semgrex
kbp.tokensregex = edu/stanford/nlp/models/kbp/chinese/tokensregex
kbp.language = zh
kbp.model = none

# entitylink
entitylink.wikidict = edu/stanford/nlp/models/kbp/chinese/wikidict_chinese.tsv.gz

I get the following parse tree entities

id:1    text:这是 pos: AUX    xpos: VC    ufeats: None    lemma: 这是   head id: 5  head: 戒指    deprel: cop

id:2    text:我  pos: PRON   xpos: PRP   ufeats: Person=1    lemma: 我    head id: 3  head: 妈妈    deprel: nsubj

id:3    text:妈妈 pos: NOUN   xpos: NN    ufeats: None    lemma: 妈妈   head id: 5  head: 戒指    deprel: acl:relcl

id:4    text:的  pos: PART   xpos: DEC   ufeats: None    lemma: 的    head id: 3  head: 妈妈    deprel: mark:relcl

id:5    text:戒指 pos: NOUN   xpos: NN    ufeats: None    lemma: 戒指   head id: 0  head: root  deprel: root
5

As we see here, we don't get any demonstratives. Do I need to make some changes to the properties file to be able to see the demonstratives? I expect the word "This" to be a demonstrative pronoun.

I see that PronType:Dem is supported in the UD dependencies for Chinese. I may be looking at the wrong link - please correct me if I am wrong here.

https://universaldependencies.org/lzh/index.html

Again, thanks for your time and if I need to provide more information, please do let me know.

AngledLuffa commented 4 years ago

Are you using the stanza models or the corenlp models to get this result?

AngledLuffa commented 4 years ago

Both stanza 1.0.1 and corenlp 4.0.0 annotate it with 这 是 split instead of together.

qipeng commented 4 years ago

@abhisheknovoic by the way the website you linked is for Classical Chinese, not modern Chinese.

It does seem like demonstrative pronouns are not annotated yet in the Chinese data our models are trained on (even though, as @AngledLuffa pointed out, the segmentation should be 这 是 with our latest models). In the meantime you should be able to use XPOS (PRD) as a feature for that, we (I) will try to fix the data upstream. Thanks for reporting!

abhisheknovoic commented 4 years ago

Hi @qipeng , sorry I wanted to clarify this for my understanding. Firstly, I am using Stanza for doing this. Does this mean that Stanza is not annotating it yet and it is on the plan for future? Sorry, I am not clear how I should use @AngledLuffa 's response above.

Thanks much for your time! This is super helpful!

qipeng commented 4 years ago

@abhisheknovoic If you're using stanza.Pipeline for this, then the properties (for stanza.server.CoreNLPClient) is actually irrelevant. And neither @AngledLuffa nor I could reproduce the segmentation you reported in the question with the latest CoreNLP or stanza models, so it's likely you're using something old (that's what @AngledLuffa was trying to point out, I think).

This is an upstream issue in the UD data we used to train these models, which I happen to maintain for Chinese. We'll fix the data and retrain the models in the future.

qipeng commented 4 years ago

@abhisheknovoic also for future reference: a minimal code sample would greatly help us diagnose the problem in future issue reports! 😆

AngledLuffa commented 4 years ago

My response above is the discovery that somewhere in this process, stanza is not being used as we expect. Therefore, you're not getting the result we expect on the text you provided. (For one thing, the corenlp properties are irrelevant if you are just using stanza models to process the data.) As is quickly becoming tradition, there isn't enough information available to recreate your result.

If I do this:

stanza.download('zh')
nlp = stanza.Pipeline('zh')
doc = nlp('这是我妈妈的戒指')
for word in doc.sentences[0].words:
  print(word.text)

I get the following:

这
是
我
妈妈
的
戒指

so the segmentation resulting from running stanza does not agree with the segmentation you have reported above.

abhisheksgumadi commented 4 years ago

@AngledLuffa @qipeng ah, I got it.

Yes I am using Stanza but the language zh-hant and not zh. Sorry I don't know Chinese and I wasn't sure if I should use zh or zh-hant.

Ok, point noted. Hopefully in the next version or later we will get to see the demonstrative. :)

Thanks to both you. You are amazing !