stanfordnlp / CoreNLP

CoreNLP: A Java suite of core NLP tools for tokenization, sentence segmentation, NER, parsing, coreference, sentiment analysis, etc.
http://stanfordnlp.github.io/CoreNLP/
GNU General Public License v3.0
9.69k stars 2.7k forks source link

Retrieve phrase based sentiment from Stanford Core NLP #465

Closed denzilc closed 7 years ago

denzilc commented 7 years ago

The Stanford Core NLP online demo gives a very nice visualization of phrase based sentiment. Every phrase in the parse tree has a sentiment. For their standard test example

This movie doesn't care about cleverness, wit or any other kind of intelligent humor.

You can see that the phrase doesn't care about cleverness, wit or any other kind of intelligent humor is marked as negative (77%) while the phrase cleverness, wit or any other kind of intelligent humor is marked as positive (76%). You can also get this information in JSON format from the website. I haven't found out a way to get these from the API. Could I get such phrase fine grained sentiments via Stanford Core NLP?

Currently, I am using the Stanford Core NLP server with the following properties

props = {"annotators": "tokenize,ssplit,pos,parse,sentiment", 'outputFormat': 'json'}
loretoparisi commented 7 years ago

@denzilc I have asked the same question. It seems that the default sentiment annotator does not return this kind of info, only a sentiment and a sentimentValue keys for each sentence like here:

{
sentimentValue: "2",
sentiment: "Neutral"
}

while you want the score distribution values as asked here that should be available with the edu.stanford.nlp.pipeline.JSONOutputter and the edu.stanford.nlp.neural.rnn.RNNCoreAnnotations but I didn't tried already.

muety commented 7 years ago

Are there any plans to include this information to the JSON output of CoreNLP server some time? I'd love to have the PROBABILITIES tree from SentimentPipeline to be encoded as JSON and returned by the server for the sentiment annotator.

J38 commented 7 years ago

So currently you can get the distribution of label scores for the whole sentence. I am going to add the sentiment tree to the json output (it is available in the "text" output already), and I'll add the probability of the prediction at each node. The tree is just going to be a string representation though.

muety commented 7 years ago

String representation is fine for now, thank you. Are you aware of any parsers for these sentiment tree strings?

J38 commented 7 years ago

This is an example of the output (now available with current GitHub code):

(ROOT|sentiment=1|prob=0.715 (NP|sentiment=2|prob=0.988 (DT|sentiment=2|prob=0.998 This) (NN|sentiment=2|prob=0.998 movie)) (@S|sentiment=1|prob=0.797 (VP|sentiment=1|prob=0.730 (@VP|sentiment=1|prob=0.932 (VBZ|sentiment=2|prob=0.997 does) (RB|sentiment=2|prob=0.994 n't)) (VP|sentiment=3|prob=0.504 (VB|sentiment=3|prob=0.962 care) (PP|sentiment=3|prob=0.727 (IN|sentiment=2|prob=0.991 about) (NP|sentiment=3|prob=0.750 (@NP|sentiment=3|prob=0.700 (@NP|sentiment=3|prob=0.798 (@NP|sentiment=3|prob=0.602 (NP|sentiment=3|prob=0.805 cleverness) (,|sentiment=2|prob=0.997 ,)) (NP|sentiment=2|prob=0.986 wit)) (CC|sentiment=2|prob=0.991 or)) (NP|sentiment=3|prob=0.616 (NP|sentiment=2|prob=0.963 (DT|sentiment=2|prob=0.995 any) (@NP|sentiment=2|prob=0.980 (JJ|sentiment=2|prob=0.998 other) (NN|sentiment=3|prob=0.983 kind))) (PP|sentiment=3|prob=0.541 (IN|sentiment=2|prob=0.993 of) (NP|sentiment=3|prob=0.744 (JJ|sentiment=3|prob=0.943 intelligent) (NN|sentiment=4|prob=0.845 humor)))))))) (.|sentiment=2|prob=0.997 .)))

J38 commented 7 years ago

That should be available in the json output with the sentimentTree key.

J38 commented 7 years ago

You can generate Stanford CoreNLP Tree objects with string representations. If you take that string and make it into a Tree, I believe the labels of the nodes will be of the form "label|sentiment=Sentiment|prob=Sentiment Probability", except for the leaves which will just have the word value for the label.

Consider this code example (based on main() in https://github.com/stanfordnlp/CoreNLP/blob/master/src/edu/stanford/nlp/trees/PennTreeReader.java):

      TreeFactory tf = new LabeledScoredTreeFactory();
      Reader r = new StringReader("string representation of tree");
      TreeReader tr = new PennTreeReader(r, tf);
      Tree t = tr.readTree();
      while (t != null) {
        System.out.println(t);
        System.out.println();
        t = tr.readTree();
      }
      r.close();
J38 commented 7 years ago

Here is a snippet going through the children of the root node and printing out their labels. There is a lot of code in Stanford CoreNLP for iterating through trees:

      for (Tree subTree : t.children()) {
        System.err.println(subTree.label());
      }

Note the label will be of this form: NP|sentiment=2|prob=0.988

J38 commented 7 years ago

If anyone thinks I should alter this output, please let me know I am open to changing it.

J38 commented 7 years ago

This is currently the full sentiment output in the json:

      "sentimentValue": "1",
      "sentiment": "Negative",
      "sentimentDistribution": [
        0.16713578785867,
        0.71513114699161,
        0.09327640121561,
        0.0167989726291,
        0.00765769130501
      ],
      "sentimentTree": "(ROOT|sentiment=1|prob=0.715\n  (NP|sentiment=2|prob=0.988 (DT|sentiment=2|prob=0.998 This) (NN|sentiment=2|prob=0.998 movie))\n  (@S|sentiment=1|prob=0.797\n    (VP|sentiment=1|prob=0.730\n      (@VP|sentiment=1|prob=0.932 (VBZ|sentiment=2|prob=0.997 does) (RB|sentiment=2|prob=0.994 n't))\n      (VP|sentiment=3|prob=0.504 (VB|sentiment=3|prob=0.962 care)\n        (PP|sentiment=3|prob=0.727 (IN|sentiment=2|prob=0.991 about)\n          (NP|sentiment=3|prob=0.750\n            (@NP|sentiment=3|prob=0.700\n              (@NP|sentiment=3|prob=0.798\n                (@NP|sentiment=3|prob=0.602 (NP|sentiment=3|prob=0.805 cleverness) (,|sentiment=2|prob=0.997 ,))\n                (NP|sentiment=2|prob=0.986 wit))\n              (CC|sentiment=2|prob=0.991 or))\n            (NP|sentiment=3|prob=0.616\n              (NP|sentiment=2|prob=0.963 (DT|sentiment=2|prob=0.995 any)\n                (@NP|sentiment=2|prob=0.980 (JJ|sentiment=2|prob=0.998 other) (NN|sentiment=3|prob=0.983 kind)))\n              (PP|sentiment=3|prob=0.541 (IN|sentiment=2|prob=0.993 of)\n                (NP|sentiment=3|prob=0.744 (JJ|sentiment=3|prob=0.943 intelligent) (NN|sentiment=4|prob=0.845 humor))))))))\n    (.|sentiment=2|prob=0.997 .)))\n"
muety commented 7 years ago

Wow, that was fast! Thanks man ☺️

loretoparisi commented 7 years ago

@J38 that was great 💯

loretoparisi commented 7 years ago

@J38 A question. Would it be possibile to have the string representing the sentimentTree as the JSON structure previously showed here? This would help to avoid to perform additional parsing (eventually wrong) and the api output would be more "standard".

Thanks.

ksteimel commented 6 years ago

Sorry to be a bother. I was just wondering if this output is available from the command line method from running CoreNLP or if it requires the standard java api.

Edit: Sorry, I didn't realize my version of CoreNLP was out of date. I've upgraded and now this works very well.

alsora commented 6 years ago

@J38 I have just updated CoreNlp to version 3.9.1 and I noticed a wrong format for the numbers in the sentiment distribution and in the sentiment tree.

"sentimentDistribution": [

    0,10110301471115,
    0,62425688378713,
    0,22854585072979,
    0,03641152898384,
    0,00968272178809
  ]

As you can see decimal numbers are represented with a comma instead of a point. The same is happening inside the probs of the sentiment tree.

This could be caused by a locale format inside my laptop (ITALIAN). Do you know how could I avoid this problem? Where in the code the StringWriter could pick up this option?

Thank you

balbinavr commented 5 years ago

@J38 I have just updated CoreNlp to version 3.9.1 and I noticed a wrong format for the numbers in the sentiment distribution and in the sentiment tree.

"sentimentDistribution": [

    0,10110301471115,
    0,62425688378713,
    0,22854585072979,
    0,03641152898384,
    0,00968272178809
  ]

As you can see decimal numbers are represented with a comma instead of a point. The same is happening inside the probs of the sentiment tree.

This could be caused by a locale format inside my laptop (ITALIAN). Do you know how could I avoid this problem? Where in the code the StringWriter could pick up this option?

Thank you

I am having the same problem. How could you finally solve this?

loretoparisi commented 5 years ago

@balbinavr this is due to the locale settings when you startup Java not to CoreNLP. To solve it it is easier as doing "-Duser.language=en -Duser.country=US Default"