stanfordnlp / CoreNLP

CoreNLP: A Java suite of core NLP tools for tokenization, sentence segmentation, NER, parsing, coreference, sentiment analysis, etc.
http://stanfordnlp.github.io/CoreNLP/
GNU General Public License v3.0
9.68k stars 2.7k forks source link

WSD issues resulting in bad lemmatization/PoS tag sequence #1381

Open stevenbedrick opened 1 year ago

stevenbedrick commented 1 year ago

Hello! I am running into a word-sense disambiguation issue where CoreNLP seems to be systematically struggling with adverbs that share a surface form with different words. For example, "number" (as in, "the number five") and "number" (as in, "his right side was number than his left"). In both sentences, CoreNLP interprets the token "number" as the noun "number" (NN); in the second sentence, it should be tagging it as an adverbial form of the adjective "numb" (RBR). The lemmatizer is also mapping "number"/RBR to "number" rather than "numb", which seems like it may be part of the issue. And, of course, since the wrong tag ends up being assigned, any downstream annotation is wrong as well (dependencies, etc.).

I've experimented a bit with different syntactic constructions, and have not yet managed to successfully find a formulation that does get CoreNLP to tag "number" as an RBR instead of an NN.

Obviously, the tagger is fundamentally a statistical model and it's gonna do what it's gonna do, but on the other hand this isn't a particularly odd word, nor is it syntactically ambiguous, so I thought I'd see if anybody else had run into this sort of issue or if there was something I could do to change the tagger's behavior. Thanks in advance for any insight you may be able to provide!

AngledLuffa commented 1 year ago

I'd like to push back a little on the idea that number_RBR is a common usage. Of the 1053 usages of number in our training data, 1052 are number_NN, and the last one is Revolution Number_NNP 9, which I believe should also be nn based on the latest tagging guidelines. Also, it should be JJR, not RBR, right? In your example, his right side is numb_JJ is clearly an adjective.

It's hard to even come up with examples that make sense. Nevertheless, on the walk to work I came up with a few. If we add these to the training data, the models might pick up that number is sometimes an adjective... but it's going to be drowning in over 1000 examples of nouns, so I'm not sure it will make much difference.

As my arthritis gets worse, my thigh gets number_JJR
Cocaine makes my lips number_JJR than meth
The only time I felt number_JJR was when I rubbed one out three times in a row

proposed parses for these:

( (S
   (SBAR
     (IN As)
     (S
       (NP (PRP$ my) (NN arthritis))
       (VP
         (VBZ gets)
         (ADJP (JJR worse)))))
   (, ,)
   (NP (PRP$ my) (NN thigh))
   (VP
     (VBZ gets)
     (ADJP (JJR number)))))

( (S
   (NP (NN Cocaine))
   (VP
     (VBZ makes)
     (S
       (NP (PRP$ my) (NNS lips))
       (ADJP
         (ADJP (JJR number))
         (PP
           (IN than)
           (NP (NN meth))))))))

( (S
   (NP
     (NP (DT The) (JJ only) (NN time))
     (SBAR
       (S
         (NP (PRP I))
         (VP
           (VBD felt)
           (ADJP (JJR number))))))
   (VP
     (VBD was)
     (SBAR
       (WHADVP (WRB when))
       (S
         (NP (PRP I))
         (VP
           (VBD rubbed)
           (NP (NN one))
           (PRT (RP out))
           (NP
             (NP (CD three) (NNS times))
             (PP
               (IN in)
               (NP (DT a) (NN row))))))))))
AngledLuffa commented 1 year ago

also, as a followup, the CoreNLP lemmatizer already properly handles number_JJR

stevenbedrick commented 1 year ago

First, thanks so much for the speedy and thoughtful reply!

Second, I think you are totally right that it should be JJR not RBR (that was my mistake, apologies), and that is indeed quite a difference in statistical distribution within the training data, so I'm not surprised that the tagger struggles. However, for whatever reason, I am seeing different behavior from you. On corenlp.run, for example, it's definitely not labeling things as JJR and is lemmatizing as "number":

Screen Shot 2023-08-01 at 1 36 58 PM

And running locally I'm getting the same result.

AngledLuffa commented 1 year ago

Oh, I wasn't clear. It currently does not label anything any tag other than NN, since the overwhelming number of training examples are of that tag. What we can do is add a few more examples in which the tag is JJR, and retrain the models (which may take a while), and then perhaps it will use that tag instead. I'm not super confident, though, considering how many NN examples there are.

stevenbedrick commented 1 year ago

Aha! I understand, thanks for the clarification. Those proposed parses look fine to me; I agree that it's not likely to overcome that degree of word-sense imbalance in the data but it certainly can't hurt to include a few bonus examples for the PoS tagger.

stevenbedrick commented 1 year ago

Oh and also, now I'm confused by your comment "also, as a followup, the CoreNLP lemmatizer already properly handles number_JJR" - as best as I can tell, it definitely is not handling that scenario, and is lemmatizing it to "number_NN".

AngledLuffa commented 1 year ago

also, as a followup, the CoreNLP lemmatizer already properly handles number_JJR

If by chance you give the lemmatizer number with the tag JJR, it returns the lemma numb. I used it to convert those trees to a UD representation in the commit I made above, for example.

https://github.com/stanfordnlp/handparsed-treebank/commit/c1a405b9875cbfa794ac4cd1c8028b5c4b5a7f30

stevenbedrick commented 1 year ago

Aha! Now I understand, thank you. So if the "right" PoS tag is assigned, the lemmatizer knows what to do with it. That's good to know!

-SB

On Tue, Aug 1, 2023 at 3:33 PM John Bauer @.***> wrote:

also, as a followup, the CoreNLP lemmatizer already properly handles number_JJR

If by chance you give the lemmatizer number with the tag JJR, it returns the lemma numb. I used it to convert those trees to a UD representation in the commit I made above, for example.

@.*** https://github.com/stanfordnlp/handparsed-treebank/commit/c1a405b9875cbfa794ac4cd1c8028b5c4b5a7f30

— Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/CoreNLP/issues/1381#issuecomment-1661188674, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAMVHLP3PYKNLSDXK7G5E3XTF74HANCNFSM6AAAAAA3AC7WJ4 . You are receiving this because you authored the thread.Message ID: @.***>