MaltParser: annotation issue

rzanoli commented 10 years ago

We tried to analyze the following sentence with MaltParser: "I live in an apartment." and we obtained this analysis:

1 I I PP 2 live live VVP 1 dep 3 in in IN 2 prep 4 an an DT 5 det 5 appartment appartment NN 3 pobj 6 . . SENT 1 dep

The analysis seems to be wrong (live should be the root) as well as a number of other examples like that that we tried. We also tried the examples in MaltParserEnTest that are in EOP obtaining the same results:

I -dep-> live appartment -det-> an in -pobj-> appartment live -prep-> in

Do you have the same errors? Do you think that there might be a disalignment between the model of MaltParser that we are using and the part-of-speech that we provide to it (i.e. that model was produced by using a different part-of-speech tagset).

gilnoh commented 10 years ago

I confirm the error. ( I also have presiously observed such cases, but didn't really raised an issue... ) Malt parser should parse the sentence correctly --- as Roberto correctly conjectured --- the problem lies with different (two different versions of) PennTree English tags.

This is the POSTags that the MaltParser expects (the English model provided by Maltparser, PennTree bank corpus of WSJ.)

INFO: Model expects [46] postags: # $ '' ( ) , . : CC CD DT EX FW IN JJ JJR JJS LS MD NN NNP NNPS NNS PDT POS PRP PRP$ PRT RB RBR RBS RP SYM TO UH VB VBD VBG VBN VBP VBZ WDT WP WP$ WRB ``

However, TreeTagger outputs different tags for verbs, such as VVP (live, in the above case). And this makes strange parser behaviors. (TreeTagger tags are more "detailed"; e.g. distinguishes BE, HAVE and other verbs, etc).

For example, Tree Tagger output is like this. I/PP live/VVP in/IN a/DT house/NN ./.

But, MaltParser expects something like I/PRP live/VBP in/IN a/DT house/NN ./.

I didn't realized this :-( ... Here's the TreeTagger tagset. (58 tags)
https://courses.washington.edu/hypertxt/csar-v02/penntable.html

Note that verb output is different from PennTree tags annotated for the parser training data. (trained on Penn Treebank, WSJ sections, that would have the following POS tags: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html )

Malt parser model uses "raw output" (not the mapped POS type of CAS POS type) of the underlying tagger, and this means we can't combine default TreeTagger English models with MaltParser English models. (For German models, such issue does not happen, since we have trained a German model with TreeTagger output ...)

One quick solution for now is using open NLP POS Tagger --- since it outputs in (pure) PennTree Bank POS tag set.

Currently, Maltparser pipeline uses: "open NLP tokenizer", "treetagger", "maltparser". But when we change the pos tagger from "treetagger" to "openNLPPosTagger" --- this tagger DOES output what Maltparser expects. Thus, the problem goes away (e.g. "live" in the above example becomes "ROOT", and so on.) --- I checked this with a small example test case.

One serious problem for this solution is that, openNLP tagger does not add Lemma. But we do want lemma for this particular (Maltpaser using) pipeline. --- So I will try to look up a solution --- e.g. let treetagger only add lemmas, not POS tags, etc --- and let you know.

( If you are in a hurry, --- just change the line 57 of MaltParserEN to use "OpenNlpPosTagger.class" instead of "TreeTaggerPosLemmaTT4J.class". Then, the Maltparser will work correctly.)

Thanks for spotting this error and report it as an issue! I will try to fix this ASAP.

reckart commented 10 years ago

DKPro Core contains a POS mapper class that we initially built exactly to map TreeTagger tags to standard Penn tags: PosMapper.

In my experience, the standard OpenNLP models for English do not yield as good POS tags as the TreeTagger models. If you need an alternative to TreeTagger, consider trying the StanfordPosTagger (mind DKPro Core issue 392 - if you can, I would strongly recommend upgrading to the just released DKPro Core 1.6.1).

gilnoh commented 10 years ago

Dear Richard,

Thanks a lot for your comment. What you have pointed out, PosMapper, is exactly what we need. Also fully agree to your recommendation of updating DKPro. However, due to other issues, updating DKPro to latest version is now scheduled in August. For now, I will provide a simple workaround (something simpler than PosMapper) for EOP pipeline of Maltparser. ...

gilnoh commented 10 years ago

I applied a temporary patch (#437), based on OpenNLP POS tagger to make it work correctly for English MaltParser pipeline. Once we do update DKPro, I will change this based on PosMapper. (workitem #436).

gilnoh commented 10 years ago

It works correctly for now (by relying on POS tags of openNLP, while lemmas are added by TreeTagger). --- so we are closing the issue for now. A better solution would be given after #436

hltfbk / Excitement-Open-Platform

MaltParser: annotation issue #431