stanfordnlp / CoreNLP

CoreNLP: A Java suite of core NLP tools for tokenization, sentence segmentation, NER, parsing, coreference, sentiment analysis, etc.
http://stanfordnlp.github.io/CoreNLP/
GNU General Public License v3.0
9.66k stars 2.7k forks source link

Is there any way to speed up CoreNLP's `Sentence.sentiment()` method? #1091

Closed cnowak7 closed 4 years ago

cnowak7 commented 4 years ago

I'm pretty new to using this java wrapper so I'll try my best to describe what's going on.

In my build.gradle, I have the following:

"edu.stanford.nlp:stanford-corenlp:4.0.0",
"edu.stanford.nlp:stanford-corenlp:4.0.0:models"

We're using this wrapper library to get the sentiment of a large number of phrases. Each phrase will always have 4 words. We start out by using Sentence.java to initialize a Sentence for each phrase like so:

new Sentence(phrase);

Then we attempt to validate sentiment of each phrase with the following:

boolean isValid = !sentence.sentiment().isNegative() && !sentence.sentiment().isExtreme();

This line of code is our culprit in that its execution can range anywhere from 17 milliseconds to over 100 milliseconds. With larger data sets, this could be problematic for execution time even with multithreading.

We've noticed that initializing via new Sentence(phrase); gives it the following default properties defined in Sentence.java:

static Properties SINGLE_SENTENCE_DOCUMENT = PropertiesUtils.asProperties(
          "language", "english",
          "ssplit.isOneSentence", "true",
          "tokenize.class", "PTBTokenizer",
          "tokenize.language", "en",
          "mention.type", "dep",
          "coref.mode", "statistical",  // Use the new coref
          "coref.md.type", "dep"
  );

We've also noticed running Sentence.sentiment() eventually runs the ParserAnnotator(String annotatorName, Properties props) constructor in ParserAnnotator.java to do the following:

String model = props.getProperty(annotatorName + ".model", LexicalizedParser.DEFAULT_PARSER_LOC);

Where DEFAULT_PARSER_LOC is edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz. By the time the line of code above is executed, the props has one extra property parse.binaryTrees being true.

I've read that there's another parser called englishSR.ser.gz. Would this be a faster alternative in my case? If so, how would I be able to override the default text parser englishPCFG.ser.gz? Any other tips on how I might be able to optimize the execution time of getting the sentiment of large data sets of phrases?

AngledLuffa commented 4 years ago

Yes, in order to speed things up you will need to use the SR Parser. You can pass in extra properties to a Sentence object using the alternate constructor

public Sentence(String text, Properties props)

I think (but am not 100% sure) that it will work if you set the parse.model property.

On Thu, Oct 1, 2020 at 8:16 AM cnowak7 notifications@github.com wrote:

I'm pretty new to using this java wrapper so I'll try my best to describe what's going on.

In my build.gradle, I have the following:

"edu.stanford.nlp:stanford-corenlp:4.0.0", "edu.stanford.nlp:stanford-corenlp:4.0.0:models"

We're using this wrapper library to get the sentiment of a large number of phrases. Each phrase will always have 4 words. We start out by using Sentence.java to initialize a Sentence for each phrase like so:

new Sentence(phrase);

Then we attempt to validate sentiment of each phrase with the following:

boolean isValid = !sentence.sentiment().isNegative() && !sentence.sentiment().isExtreme();

This line of code is our culprit in that its execution can range anywhere from 17 milliseconds to over 100 milliseconds. With larger data sets, this could be problematic for execution time even with multithreading.

We've noticed that initializing via new Sentence(phrase); gives it the following default properties defined in Sentence.java:

static Properties SINGLE_SENTENCE_DOCUMENT = PropertiesUtils.asProperties( "language", "english", "ssplit.isOneSentence", "true", "tokenize.class", "PTBTokenizer", "tokenize.language", "en", "mention.type", "dep", "coref.mode", "statistical", // Use the new coref "coref.md.type", "dep" );

We've also noticed running Sentence.sentiment() eventually runs the ParserAnnotator(String annotatorName, Properties props) constructor in ParserAnnotator.java to do the following:

String model = props.getProperty(annotatorName + ".model", LexicalizedParser.DEFAULT_PARSER_LOC);

Where DEFAULT_PARSER_LOC is edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz. By the time the line of code above is executed, the props has one extra property parse.binaryTrees being true.

I've read that there's another parser called englishSR.ser.gz. Would this be a faster alternative in my case? If so, how would I be able to override the default text parser englishPCFG.ser.gz? Any other tips on how I might be able to optimize the execution time of getting the sentiment of large data sets of phrases?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/CoreNLP/issues/1091, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWOO7MXOUCNSQSM3KMTSISMNVANCNFSM4SASDCPA .

cnowak7 commented 4 years ago

@AngledLuffa thanks for the suggestion! Unfortunately, englishSR is not provided from that gradle dependency. I've downloaded the English (extra) jar and tried adding it to my project but I'm getting a file not found exception when running sentiment() after initializing a sentence like so:

new Sentence(
    phrase,
    PropertiesUtils.asProperties(
        "language", "english",
        "ssplit.isOneSentence", "true",
        "tokenize.class", "PTBTokenizer",
        "tokenize.language", "en",
        "mention.type", "dep",
        "coref.mode", "statistical",  // Use the new coref
        "coref.md.type", "dep",
//                "parse.binaryTrees", "true",
        "parse.model", "/Users/myUserName/Downloads/stanford-english-corenlp-models-current.jar!/edu/stanford/nlp/models/srparser/englishSR.ser.gz"
    )
);

The exception is coming from the ParserGrammar.loadModel(String path, String ... extraFlags) method

AngledLuffa commented 4 years ago

I don't know anything about gradle, so I have no idea if that's the correct syntax for changing the properties. I do know that if you have the appropriate jar in your classpath, you would not need the !/ syntax but should be able to just use

edu/stanford/nlp/models/srparser/englishSR.ser.gz

I do have a little bit of doubt about how this is being used when you mentioned that these are all 4 word phrases. If by chance you already know the structure of the phrases, you might have better luck mocking up the parse trees yourself. Alternatively, you may also have better results using the CNN sentiment classifier in stanfordnlp/stanza, if using python is an option.

cnowak7 commented 4 years ago

@AngledLuffa Thanks a ton for your input! I've extracted the englishSR file and put it into my project - noticeably way faster!

cnowak7 commented 4 years ago

@AngledLuffa So now we're constructing Sentences with the following code:

new Sentence(
    phrase,
    PropertiesUtils.asProperties(
        "language", "english",
        "ssplit.isOneSentence", "true",
        "tokenize.class", "PTBTokenizer",
        "tokenize.language", "en",
        "mention.type", "dep",
        "coref.mode", "statistical",  // Use the new coref
        "coref.md.type", "dep",
        "parse.binaryTrees", "true",
        "parse.model", "englishSR.ser.gz"
    )
);

After running a comparison between the default text parser (englishPCFG) and englishSR with 1 million phrases, englishSR significantly out-performs englishPCFG (~23 minutes vs. ~89 minutes with respect to our project and data set), but the memory usage with englishPCFG is more steady and stable while englishSR is slightly worrisome in this regard.

Are there any properties/options we can use to optimize memory utilization or is this expected? See our memory trend comparisons below (for context, both results are from 4 threads simultaneously performing sentiment analysis):

englishPCFG with 1 million phrases:

stanfordEnglishPCFG-1millionPhrases

englishSR with 1 million phrases:

stanfordEnglishSR-1millionPhrases
AngledLuffa commented 4 years ago

The SR parser just uses more memory for its model. There may be room to optimize how often & how much it allocates per query, but I haven't looked at that for a while.

You will probably get better results only allocating the properties object once, fwiw

cnowak7 commented 4 years ago

@AngledLuffa sentiment analysis in our project seems to be running around 74% faster - this was a huge help - thanks! Closing this issue.