louismullie / stanford-core-nlp

Ruby bindings to the Stanford Core NLP tools (English, French, German).
Other
432 stars 70 forks source link

Instructions on how to add custom classes to the pipeline #21

Closed jure closed 11 years ago

jure commented 11 years ago

For example, if I'd like to use the TimeAnnotator in a pipeline, how do I go about that? I've loaded the TimeAnnotator class with:

StanfordCoreNLP.load_class('TimeAnnotator', 'edu.stanford.nlp.time') 

But trying to load it in pipeline gives me this:

pipeline =  StanfordCoreNLP.load(:tokenize, :ssplit, :pos, :lemma, :parse, :ner, :dcoref, :time)
Exception `IllegalArgumentException' at /home/deploy/apps/app/shared/bundle/ruby/1.9.1/gems/stanford-core-nlp-0.5.1/lib/stanford-core-nlp.rb:188 - No annotator named time

This is probably a pretty common use case, so it would be nice if there were instructions in the readme. I can add them in a PR, once I know how it's done.

Update: For clarity, what I'm trying to do is extract temporal information from text and I'm more interested in phrases than tokens.

louismullie commented 11 years ago

Hey @jure,

From the list up on Stanford Core NLP's website, it seems that the "time" annotator is not supported. However,

StanfordCoreNLP includes SUTime, Stanford's temporal expression recognizer. SUTime is transparently called from the "ner" annotator, so no configuration is necessary. Furthermore, the "cleanxml" annotator now extracts the reference date for a given XML document, so relative dates, e.g., "yesterday", are transparently normalized with no configuration necessary.

Does that seem to be what you're looking for?

jure commented 11 years ago

Thanks for taking the time to answer me @louismullie! I do get time annotations by using the "ner" annotator, but I was hoping for some speedups by using TimeAnnotator, i.e. operating on phrases not on tokens, but it seems my understanding of this is insufficient, as I now find it all goes at least through tokenize, ssplit, parse and lemma annotators. For the text I'd like to process (tweets) it takes roughly 0.4 seconds per tweet to extract time information, which is too slow for my use case.

Benchmark.measure { 100.times { pipeline.annotate(text) } }
38.400000   0.000000  38.400000 ( 38.086564)

I also don't know how to set the document date from the stanford-core-nlp-gem (it's set in edu.stanford.nlp.ling.CoreAnnotations.DocDateAnnotation), so the TIMEX3 annotations I get are all relative and not really human readable (e.g. XXXX-WXX-5)

I guess what I really want is a fast ruby native temporal extraction library. Given that SUTime is a regex based extractor, maybe the best way would be to port it to ruby. But that's really no issue of yours :) I'm closing this.

louismullie commented 11 years ago

I would suggest you have a look at chronic, kronic and nickel - all are supported through Treat.

jure commented 11 years ago

I'm currently using nickel and it works fast indeed, but it's very inaccurate, compared to SUTime. Nickel has way more false positives whereas I'm yet to find a wrongly extracted date with SUTime.

Do you know how I could set the current DocDate with stanford-core-nlp gem, so that the TIMEX3 values would be evaluated based on that and show real dates whenever possible?

louismullie commented 11 years ago

Are you able to send me a Java snippet on how that's done in Core NLP? I can't seem to find any doc.

jure commented 11 years ago

This method from SUTimeMain.java might shed some light on the matter:

public static Annotation textToAnnotation(AnnotationPipeline pipeline, String text, String date)
  {
    Annotation annotation = new Annotation(text);
    annotation.set(CoreAnnotations.DocDateAnnotation.class, date);
    pipeline.annotate(annotation);
    return annotation;
  }

I think the date format is:

requiredDocDateFormat = "yyyy-MM-dd";
jure commented 11 years ago

@louismullie I figured it out, at least when using JRuby:

text.set(Java::Edu.stanford.nlp.ling.CoreAnnotations::DocDateAnnotation, '2013-07-02') 

With regards to performance, it's actually not that slow for small bodies of text. I ran a few bechmarks against https://github.com/lzell/nickel with this code:

require 'stanford-core-nlp'
require 'jruby/profiler'
require 'nickel'
StanfordCoreNLP.jar_path = 'vendor/stanford-core-nlp-minimal/'
StanfordCoreNLP.model_path = 'vendor/stanford-core-nlp-minimal/'

pipeline = StanfordCoreNLP.load(:tokenize, :ssplit, :pos, :lemma, :ner)

text1 = 'Measure your #metrics. Understand them and optimize them! Today at start:Cloud http://t.co/jZXNCGQfKS'
annotation1 = StanfordCoreNLP::Annotation.new(text1)
text2 = "George W. Bush: Snowden damaged U.S. While he was in office, Bush set up a plan -- the President's Emergency Program for AIDS Relief -- that made a massive investment in antiretroviral drugs and dramatically reduced the number of AIDS deaths in Africa. 'I'm really proud of the American people for their generosity,' he told CNN in an exclusive interview. 'I wish Americans knew how many lives were saved. Someday, they will.' CNN exclusive: George W. Bush on AIDS, Mandela, Snowden Bush got his own hands dirty working on the refurbishment of the clinic -- a sign, he said, of his commitment to the cause. 'I'm here to serve and I believe strongly that with power and wealth comes a duty to serve the least,' he said. 'Our purpose is to elevate the need for screening for cervical cancer throughout the continent of Africa.' The renovated clinic opened Monday as a cervical cancer screening and treatment center, and the Bushes hope it will help save the lives of thousands of women. 'It breaks your heart to realize that such hope was given to communities throughout the continent of Africa because of antiretrovirals and then women are dying of cervical cancer -- so there's hope and then there's despondency,' George Bush said. 'We wanted to help make sure that despondency didn't settle in.' Bush at his best Bush's harshest critics might be surprised by this side of the former president. His shirt splattered with paint from the project, he appeared genuinely happy about doing manual labor. He was joking, jovial and genuinely happy to be there. Bush was relaxed as he spoke with CNN, saying that after an extended period out of the limelight he feels like a tortoise sticking its head out of its shell. The former president is aware that his legacy will be tied to the Iraq War, but he wants people to know what he is doing in Africa, too. 'History will judge' In his comments, Bush also touched on the subject of Nelson Mandela, who is on life support in a South African hospital. 'Sometimes, there are leaders who come and go. His legacy will last for a long time,' he said of the ailing anti-apartheid icon. Mandela had criticized him publicly about the war in Iraq, Bush said he doesn't bear a grudge. 'He wasn't the only guy,' he said. 'It's OK. I made decisions that were the right decisions. History will ultimately judge. I never held someone's opinion against him; I didn't look at him differently because he didn't agree with me on an issue.' Bush also initially said he wasn't bothered about his ratings in opinion polls, even if some of them now put him at a similar level to Obama. 'The only time I really cared was on Election Day,' he said. Then, drawing laughter from his wife, he checked himself and said, 'You know, I guess it's nice. I mean, let me rephrase that: Thank you for bringing it up.' In any case, the former president said he doesn't expect a fair assessment of his legacy in his lifetime. 'I won't be around, because it will take a while for the objective historians to show up,' he said. 'So I'm pretty comfortable with it. I did what I did; I know the spirit in which I did it.'"
annotation2 = StanfordCoreNLP::Annotation.new(text2)

profile_data0 = JRuby::Profiler.profile do
  100.times{pipeline.annotate(annotation1)}
end

profile_data1 = JRuby::Profiler.profile do
  100.times{Nickel.parse(text1)}
end

profile_data2 = JRuby::Profiler.profile do
  100.times{pipeline.annotate(annotation2)}
end

profile_data3 = JRuby::Profiler.profile do
  100.times{Nickel.parse(text2)}
end

profile_printer0 = JRuby::Profiler::FlatProfilePrinter.new(profile_data0)
profile_printer0.printProfile(STDOUT)

profile_printer1 = JRuby::Profiler::FlatProfilePrinter.new(profile_data1)
profile_printer1.printProfile(STDOUT)

profile_printer2 = JRuby::Profiler::FlatProfilePrinter.new(profile_data2)
profile_printer2.printProfile(STDOUT)

profile_printer3 = JRuby::Profiler::FlatProfilePrinter.new(profile_data3)
profile_printer3.printProfile(STDOUT)

And the results:

Total time: 5.49 (stanford-core-nlp, short text)

     total        self    children       calls  method
----------------------------------------------------------------
      5.48        5.48        0.00         100  Java::EduStanfordNlpPipeline::AnnotationPipeline#annotate

Total time: 2.59 (nickel, short text)

     total        self    children       calls  method
----------------------------------------------------------------
      2.59        0.00        2.58         100  Nickel.parse

Total time: 72.56 (stanford-core-nlp, long text)

     total        self    children       calls  method
----------------------------------------------------------------
     72.56       72.56        0.00         100  Java::EduStanfordNlpPipeline::AnnotationPipeline#annotate

Total time: 12.65 (nickel, long text)

     total        self    children       calls  method
----------------------------------------------------------------
     12.65        0.00       12.65         100  Nickel.parse