stanfordnlp / CoreNLP

CoreNLP: A Java suite of core NLP tools for tokenization, sentence segmentation, NER, parsing, coreference, sentiment analysis, etc.
http://stanfordnlp.github.io/CoreNLP/
GNU General Public License v3.0
9.65k stars 2.7k forks source link

OpenIE fails for some sentences #187

Closed naoya-i closed 8 years ago

naoya-i commented 8 years ago

Hi,

I use Stanford OpenIE (http://stanfordnlp.github.io/CoreNLP/openie.html) to extract triples from Gigaword corpus. I call "edu.stanford.nlp.naturalli.OpenIE" module from Stanford CoreNLP jar files as follows:

$ echo "John was born in the US." | java -mx1g -cp stanford-corenlp-3.6.0.jar:stanford-corenlp-3.6.0-models.jar:CoreNLP-to-HTML.xsl:slf4j-api.jar:slf4j-simple.jar edu.stanford.nlp.naturalli.OpenIE

However, some sentences from Gigaword corpus crash Stanford OpenIE as follows:

$ echo "In the meantime the only road in and out of the city crosses a Bosnian Serb checkpoint." | java -mx1g -cp stanford-corenlp-3.6.0.jar:stanford-corenlp-3.6.0-models.jar:CoreNLP-to-HTML.xsl:slf4j-api.jar:slf4j-simple.jar edu.stanford.nlp.naturalli.OpenIE
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ssplit
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator pos
Reading POS tagger model from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [0.8 sec].
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator lemma
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator depparse
Loading depparse model file: edu/stanford/nlp/models/parser/nndep/english_UD.gz ...
PreComputed 100000, Elapsed Time: 1.606 (s)
Initializing dependency parser done [4.6 sec].
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator natlog
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator openie
Loading clause searcher from edu/stanford/nlp/models/naturalli/clauseSearcherModel.ser.gz...done [0.90 seconds]
Processing from stdin. Enter one sentence per line.
Exception in thread "main" java.util.NoSuchElementException: No value present
       at java.util.Optional.get(Optional.java:135)
       at edu.stanford.nlp.naturalli.RelationTripleSegmenter.extract(RelationTripleSegmenter.java:282)
       at edu.stanford.nlp.naturalli.OpenIE.annotateSentence(OpenIE.java:485)
       at edu.stanford.nlp.naturalli.OpenIE.lambda$annotate$3(OpenIE.java:554)
       at edu.stanford.nlp.naturalli.OpenIE$$Lambda$24/1197365356.accept(Unknown Source)
       at java.util.ArrayList.forEach(ArrayList.java:1249)
       at edu.stanford.nlp.naturalli.OpenIE.annotate(OpenIE.java:554)
       at edu.stanford.nlp.pipeline.AnnotationPipeline.annotate(AnnotationPipeline.java:71)
       at edu.stanford.nlp.pipeline.StanfordCoreNLP.annotate(StanfordCoreNLP.java:499)
       at edu.stanford.nlp.naturalli.OpenIE.processDocument(OpenIE.java:630)
       at edu.stanford.nlp.naturalli.OpenIE.main(OpenIE.java:736)

So far, I couldn't find any regularity of sentences that can cause this Java exception. For reference, I also pasted other 9 sentences that can cause the Java exception.

Of course, it would be happy if the error is fixed. However, the happier solution that I personally think is to let Stanford OpenIE have "-ignore-errors" option, which is implemented in Ollie, University of Washington's OpenIE system (https://knowitall.github.io/ollie/). The "-ignore-errors" option makes the software more error-tolerant, allowiing us to skip a sentence that causes an error, and just move on to the next sentence. This should be extremely useful for parsing a large file.

gangeli commented 8 years ago

Thanks for reporting this! Are you using the official release (3.6.0), or the GitHub HEAD version of the code? I remember I fixed an error similar to this a bit ago, and corenlp.run doesn't crash on the sentence, which means that hopefully it's the same bug. If you're not already on it, you can build the GitHub code with ant jar, and use the resulting jar file instead of the official release. I think the models should be the same, but if they're not, there's a link to download the most recent models from the project homepage.

The lack of an -ignore-errors flag is actually kind of deliberate. I'd like to hold OpenIE to a standard of never crashing (after all, the rest of CoreNLP doesn't crash either), and therefore any exception should be treated as a critical bug that should be fixed quickly.

naoya-i commented 8 years ago

Thanks for your reply! I have tried only the official release (3.6.0) at that time, so I tried the GitHub version at this time. Fortunately, the GitHub version did not crash on all the sentences that I mentioned. For the time being, I will work on this version.

The lack of an -ignore-errors flag is actually kind of deliberate. I'd like to hold OpenIE to a standard of never crashing (after all, the rest of CoreNLP doesn't crash either), and therefore any exception should be treated as a critical bug that should be fixed quickly.

Ok, I understand the philosophy behind CoreNLP ;-) If I encounter some other problems, I’ll come back again!

Thanks!

Naoya

On May 13, 2016, at 06:26, Gabor Angeli notifications@github.com wrote:

Thanks for reporting this! Are you using the official release (3.6.0), or the GitHub HEAD version of the code? I remember I fixed an error similar to this a bit ago, and corenlp.run doesn't crash on the sentence, which means that hopefully it's the same bug. If you're not already on it, you can build the GitHub code with ant jar, and use the resulting jar file instead of the official release. I think the models should be the same, but if they're not, there's a link to download the most recent models from the project homepage.

The lack of an -ignore-errors flag is actually kind of deliberate. I'd like to hold OpenIE to a standard of never crashing (after all, the rest of CoreNLP doesn't crash either), and therefore any exception should be treated as a critical bug that should be fixed quickly.

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub