nikitakit / self-attentive-parser

High-accuracy NLP parser with models for 11 languages.
https://parser.kitaev.io/
MIT License
861 stars 153 forks source link

Tagging errors #20

Closed bongbang closed 5 years ago

bongbang commented 5 years ago

First, thank you for making this wonderful tool available. I can't say enough good things about it. Very impressive indeed.

That makes these bizarre tagger errors all the more surprising.

import spacy
spacy.__version__ # '2.1.3'
import benepar
benepar.download('benepar_en2')
from benepar.spacy_plugin import BeneparComponent
nlp = spacy.load('en_core_web_lg')
sentencizer = nlp.create_pipe("sentencizer")
nlp.add_pipe(sentencizer, first=True)

bnp = BeneparComponent("benepar_en2")
nlp.add_pipe(bnp, before='parser')
nlp.pipe_names # ['sentencizer', 'tagger', 'benepar', 'parser', 'ner']

text = "Justin O'Mara Brown (born April 16, 1982 in Fletcher, Oklahoma) is a former professional American, Canadian football and arena football defensive end. He was signed as an undrafted free agent by the Indianapolis Colts in 2005." # Wikipedia
docs = nlp(text, disable=['benepar'])
docb = nlp(text, disable=['tagger'])

for s, b in zip(docs, docb):
    if s.tag != b.tag:
        print(s.i, s.text, s.tag_, b.tag_)

Output:

18 American JJ NNP
20 Canadian JJ NNP
28 He PRP NNP

In all three, spaCy's tagger was right and BNP's was wrong. The third one was the most troubling. With all due respect to the famous mariner Zheng He, it's not very wise to guess that "He" at the beginning of a sentence is any thing other than a pronoun.

When I read that "tagger in models such as benepar_en2 gives better results," I actually went out of my way to make sure that BNP's tags are used instead of spaCy's. Now I'm not so sure. Can you quantify or describe how BNP's tagging is better, please? I don't have time to do an exhaustive test, so your input will carry a lot of weight. How do you think benepar_en2 compares with spaCy 2.1.3's en_core_web_lg for tagging?

Again, I emphasize that these tagging errors in no way detract from the project, which to me is about parsing, and BNP is excellent at that.

nikitakit commented 5 years ago

Thanks for pointing this out! I'm sorry that I did not see your post earlier.

I've confirmed the example you sent on my machine. It's unfortunate that the model is making a mistake in this case.

When it comes to taggers, every model is imperfect in its own way. I first started to pay attention this after seeing a consistent pattern of errors when handling imperatives. This was brought to my attention by several users who were trying to use the parser on natural language commands, which seems like a desirable use-case to support out of the box. The NLTK default tagger is notoriously bad at this, to the point where I can't in good faith recommend using the NLTK integration without upgrading to benepar_en2. SpaCy is actually quite a bit better, though not perfect -- for example, my copy of en_core_web_lg tags "lower" as an adverb in the phrase "lower the blinds." Funnily enough en_core_web_sm seems to get this right, but it has errors for different imperatives. (I'm using spaCy 2.0.12 -- given how fast the field moves I wouldn't be surprised if newer versions of spaCy have gotten better.)

In terms of accuracy numbers, before I released the benepar_en2 model I did a quick benchmark on WSJ sections 22-24 (part of the Penn Treebank). I'm getting a tagging accuracy of 95.96 with en_core_web_lg and 97.30 with benepar_en2. Of course, actual accuracy can vary depending on the application. Disclaimer: I haven't made sure that these numbers are comparable to any past published work.

Another thing to note is that you're comparing en_core_web_lg (the largest of the spaCy models) with benepar_en2 (the smaller of the English parsers). If you use benepar_en2_large instead the three mismatches above all go away.

Hope this helps!