Problems with IndexedWord word() and value()

stanfordnlp / CoreNLP

CoreNLP: A Java suite of core NLP tools for tokenization, sentence segmentation, NER, parsing, coreference, sentiment analysis, etc.

http://stanfordnlp.github.io/CoreNLP/

GNU General Public License v3.0

9.66k stars 2.7k forks source link

Problems with IndexedWord word() and value() #49

Open annefried opened 9 years ago

annefried commented 9 years ago

Hi, in the context of dkpro's StanfordCoreferenceResolver, I found the following problem: Stanford CoreNLP (v3.4.1) seems to plan to make changes at IndexedWord: word() and value() both exist but according to a comment, should be unified at some time.

Details:

StanfordCoreferenceResolver creates the collapsed dependencies this way: ParserAnnotatorUtils.fillInParseAnnotations(false, true, gsf, sentence, treeCopy);

Dcoref's Document.java makes use of the function getNodeByWordPattern of SemanticGraph, which in turn uses w.word(). This does not seem to be set by fillInParseAnnotations.

value() is set, however, so I preliminarily fixed the problem by adding the following right after fillInParseAnnotations in StanfordCoreferenceResolver.

SemanticGraph deps = sentence.get(SemanticGraphCoreAnnotations.CollapsedDependenciesAnnotation.class); for (IndexedWord vertex : deps.vertexSet()) { vertex.setWord(vertex.value()); }

The problem should be fixed in StanfordCoreNLP, however.

manning commented 9 years ago

Yes, there have traditionally been various bugs because of the separation of word() and value() - even though there were some good reasons for it.

But have you tried to reproduce this on v3.5 or the current master? I just added code to the test of StanfordCoreNLPITest:test(), based on your code above:

  // check that dependency graph Labels have word()                                                                                               
  SemanticGraph deps = sentence.get(SemanticGraphCoreAnnotations.CollapsedDependenciesAnnotation.class);
  for (IndexedWord vertex : deps.vertexSet()) {
    Assert.assertNotNull(vertex.word());
    Assert.assertEquals(vertex.word(), vertex.value());
  }

That code passes for me. So maybe this is already fixed? If not can you provide a test case that fails?

(I realize that you may have good compatibility reasons for staying off Java 8 at the moment, but I suspect right now we don't have the energy to release a v3.4.2 unless more or more serious problems are found....)

reckart commented 9 years ago

I have reproduced the problem with the trunk version of DKPro Core's StanfordCoreferenceResolver which uses CoreNLP 3.5.0.

We use transform the UIMA CAS representation of the required annotations to the CoreNLP representation prior to invoking the MentionExtractor. The procedure is approximately as follows:

convert tokens
convert parse tree using LabeledScoredTreeFactory and TreeUtils.createStanfordTree
convert sentences
re-generate dependencies from the parse tree using a GrammaticalStructureFactory obtained from the TreebankLanguagePack and finalizing the conversion using ParserAnnotatorUtils.fillInParseAnnotations.

In this process, value() and word() are not set to the same value. I added the code suggested by Anne as a workaround.

So I guess you are saying that word() and value() should be set to the same value by DKPro Core when the tokens are converted from UIMA to CoreNLP? Currently, we only set

originalText
word
beginPosition
endPosition
lemma (optional)
tag (optional)

reckart commented 9 years ago

I tried an alternative fix in DKPro Core by calling setValue(token-text) when converting the UIMA token to the CoreNLP token (actually CoreLabel) and removing the fix that Anne suggested - that alternative fix, however, doesn't work. Not sure where/why the "value" gets lost in the process mentioned above.