nikitakit / self-attentive-parser

High-accuracy NLP parser with models for 11 languages.
https://parser.kitaev.io/
MIT License
868 stars 153 forks source link

Questioning a parse result #3

Closed colingoldberg closed 6 years ago

colingoldberg commented 6 years ago

I am not sure about the following - please clarify - or report a bug (?)

The text I am parsing: when a notification is received that a driver is available then update that fact in the database

Result from parse_string:

print(sent._.parse_string) (FRAG (SBAR (WHADVP (WRB when)) (S (NP (DT a) (NN notification)) (VP (VBZ is) (VP (VBN received) (SBAR (IN that) (S (NP (DT a) (NN driver)) (VP (VBZ is) (ADJP (JJ available))))))))) (ADVP (RB then)) (VP (NN update) (NP (DT that) (NN fact)) (PP (IN in) (NP (DT the) (NN database)))))

In it, I see that the word 'update' is marked as NN. Should it not be a verb?

[The following added later, after further exploration]

A test script (below) produces a dict that is missing the entry for the word "update" - I think because no label was available for the word.

Please excuse the naivete of the test - I am a newcomer to Python as well.

Note: spaces replaced by dots.

` import benepar import spacy from benepar.spacy_plugin import BeneparComponent nlp = spacy.load('en') nlp.add_pipe(BeneparComponent("benepar_en"))

doc = nlp("when a notification is received that a driver is available then update that fact in the database") sent = list(doc.sents)[0] print(sent._.parse_string)

def.get_children(parent): ..retdict.=.{} ..if.len(list(parent..children)).>.0: ....for.child.in.parent..children: ......print(child) ......try: ........if.len(list(child..labels)).>.0: ..........lab.=.list(child._.labels)[0] ..........print(lab) ..........child_dict.=.{"label":.lab,."text":.str(child)} ..........gc.=.get_children(child) ..........child_dict["children"].=.gc ..........ret_dict[lab].=.child_dict ......except.Exception.as.e: ...... pass ..return.ret_dict

gc = get_children(sent) print(gc) `

The following gc result was output: { "SBAR": { "label": "SBAR", "text": "when a notification is received that a driver is available", "children": { "WHADVP": { "label": "WHADVP", "text": "when", "children": {} }, "S": { "label": "S", "text": "a notification is received that a driver is available", "children": { "NP": { "label": "NP", "text": "a notification", "children": {} }, "VP": { "label": "VP", "text": "is received that a driver is available", "children": { "VP": { "label": "VP", "text": "received that a driver is available", "children": { "SBAR": { "label": "SBAR", "text": "that a driver is available", "children": { "S": { "label": "S", "text": "a driver is available", "children": { "NP": { "label": "NP", "text": "a driver", "children": {} }, "VP": { "label": "VP", "text": "is available", "children": { "ADJP": { "label": "ADJP", "text": "available", "children": {} } } } } } } } } } } } } } } }, "ADVP": { "label": "ADVP", "text": "then", "children": {} }, "VP": { "label": "VP", "text": "update that fact in the database", "children": { "NP": { "label": "NP", "text": "that fact", "children": {} }, "PP": { "label": "PP", "text": "in the database", "children": { "NP": { "label": "NP", "text": "the database", "children": {} } } } } } } Note: Word "update" is missing.

Colin Goldberg

nikitakit commented 6 years ago

In response to your first point: you're completely right that "update" should be a verb; machine learning systems do make mistakes.

The situation here is a bit worse than usual because there's a verb phrase (VP) with no verb inside, which makes no sense. The reason this happens is that benepar doesn't actually do part-of-speech tagging: all tags come from spaCy (or NLTK).

benepar doesn't do its own tagging because several recent parsing papers have found that neural network parsers can work better if you don't considering part-of-speech tagging at all. I'm thinking about how to address the parser/tagger mismatch, but that's still an area for future research.


In response to your edit: the way it's implemented span._.labels does not include part-of-speech tags, only span labels. Part-of-speech tags in spaCy are accessible using token.tag_ (e.g. if len(span) == 1: tag=span[0].tag_). You can also look at the code I posted in #2. Also, two children may have the same label, so your tree-to-dict conversion may lose nodes due to key collisions.

colingoldberg commented 6 years ago

Thanks for the clarification, although I am not comfortable that "[ml] systems do make mistakes" - I do want to be able to rely on its output. It may be that I will be submitting only parts of sentences, depending on what I have that I can work with.

I added two tests, which may be source for some insight:

1) "update that fact in the database" produced: (S (VP (VB update) (NP (DT that) (NN fact)) (PP (IN in) (NP (DT the) (NN database))))) ie. "update" seen as a verb

2) "when a notification is received that a driver is available, update that fact in the database" (ie. replacing the word "then" with a comma) produced: (S (SBAR (WHADVP (WRB when)) (S (NP (DT a) (NN notification)) (VP (VBZ is) (VP (VBN received) (SBAR (IN that) (S (NP (DT a) (NN driver)) (VP (VBZ is) (ADJP (JJ available))))))))) (, ,) (VP (VBP update) (NP (DT that) (NN fact)) (PP (IN in) (NP (DT the) (NN database)))))

Does this information give you anything to consider regarding this question?

Colin Goldberg

nikitakit commented 6 years ago

I completely understand that you're dissatisfied with getting the wrong outputs here.

Like I said before, the part-of-speech tags are provided entirely by spaCy. You can try for yourself by running the tagger without the constituency parser:

import spacy
nlp = spacy.load('en')
# Note: no BeneparComponent!
doc = nlp("when a notification is received that a driver is available then update that fact in the database")
print(doc[11], doc[11].tag_) # update NN

Your best bet may be to swap to a different part-of-speech tagger. Note that I'm not familiar with any of the code or training procedures for these models, so the spaCy team is likely more knowledgeable on this matter.

I did find that using the spaCy en_core_web_lg gave the correct tag. To install (~800 MB download):

python -m spacy download en_core_web_lg

Example:

import spacy
from benepar.spacy_plugin import BeneparComponent
nlp = spacy.load('en_core_web_lg')
nlp.add_pipe(BeneparComponent("benepar_en"))
doc = nlp("when a notification is received that a driver is available then update that fact in the database")
print(doc[11], doc[11].tag_) # update VB

print(list(doc.sents)[0]._.parse_string)
# (S (SBAR (WHADVP (WRB when)) (S (NP (DT a) (NN notification)) (VP (VBZ is) (VP (VBN received) (SBAR (IN that) (S (NP (DT a) (NN driver)) (VP (VBZ is) (ADJP (JJ available))))))))) (ADVP (RB then)) (VP (VB update) (NP (DT that) (NN fact)) (PP (IN in) (NP (DT the) (NN database)))))

There's also a medium-size en_core_web_md model (115 MB download), which should be somewhere halfway between en and en_core_web_lg. I haven't tested how well its tagger does, though.


Other options you can try:

Writing/training a part-of-speech tagger that works jointly with the parser is definitely something I'm thinking about, but like any research project it will require some time to carry out.

colingoldberg commented 6 years ago

Thank you for your thoughtful response - there's lots to consider. I am glad to see that a constituency parser is available via spaCy - my sense is that it is an important addition to the resources available.

nikitakit commented 6 years ago

Closing the issue; hopefully my last comments were able to help you get started on addressing the tagging issues you're seeing

As a summary: currently the recommended way of getting the best POS tags is to use the largest spaCy model. Larger spaCy models tend to perform better than smaller ones. The ability to swap out taggers also means that spaCy is likely a better option than NLTK: I am aware of cases where spaCy's tagger outperforms the default NLTK tagger, and I'm not familiar with NLTK taggers beyond the default one.

Parser-tagger consistency is a research direction I'll be looking at in the future, but at the moment I don't have anything new on that front.

nikitakit commented 5 years ago

Update as of v0.1.0: newer models such as benepar_en2 and benepar_en2_large now include an integrated part-of-speech tagger, which appears to be higher-accuracy than the current spaCy models and doesn't make a mistake for this particular example.