stanfordnlp / CoreNLP

CoreNLP: A Java suite of core NLP tools for tokenization, sentence segmentation, NER, parsing, coreference, sentiment analysis, etc.
http://stanfordnlp.github.io/CoreNLP/
GNU General Public License v3.0
9.66k stars 2.7k forks source link

POS tag errors #1141

Closed zapatistas closed 2 years ago

zapatistas commented 3 years ago

Hello, I am using the maximum entropy tagger combined with a the lexical parser and I noticed that some proper nouns such as cities as "bombay" or "rome" are not recognized by the tagger or are assigned noun tags. Also there are times where nouns ( f.e ("food") or adjectives (such as "spanish") are tagged as proper nouns. An example is a phrase like "spanish food" where both these were assigned a proper noun tag ( when using the "english-left3words-distsim.tagger" model I only get "spanish" as proper noun). In another instance the city "madrid" was assigned an adjective tag. When I tried to use the lexical parser without pos tags generated beforehand, I get similar errors but f.e "madrid" is assigned an adverb tag instead (the sentence was "in madrid please").

I started by using the POS tag model provided in the "english-bidirectional-distsim.tagger" file and I am running the tagger in Visual Studio through the .NET NuGet package provided. The dataset I am using is the The (6) dialog bAbI tasks dataset. When I am using the "english-left3words-distsim.tagger" I do not get unknown word tags but still get proper nouns like "bombay" and "madrid" erroneously assigned as nouns etc.

Are these errors expected or is it something that might be wrong with my implementation that causes this?

Thank you.

AngledLuffa commented 3 years ago

First question, are you feeding it lower case text or properly capitalized? If I put this into the web demo, it tags both Madrid and Bombay as NNP

I went to Madrid for a conference and made some friends from Bombay

zapatistas commented 3 years ago

The words are lower case as I presented them, but because the purpose of this analysis is to be used for pattern recognition in task-oriented agents I decided to not alter the text. When I feed it "london" or "paris" it recognizes those words as proper nouns though. My biggest concern is that the tagger confuses proper nouns with adjectives/adverbs and vice versa adjectives like "spanish" for pronouns, thus creating inconsistent tags above an acceptable or manageable rate. For example, similar adjectives derived from ethnicity such as "italian" or "indian" are correctly categorized, but for some reason "spanish" is not.

AngledLuffa commented 3 years ago

The models aren't trained on enough countries with lowercase names, and sometimes it's able to recognize them anyway. There are caseless models which you can try if you need to use lowercase country names. I'd suggest either doing that or processing the names to be capitalized.

On Wed, Mar 10, 2021 at 7:15 PM zapatistas @.***> wrote:

The words are lower case as I presented them, but because the purpose of this analysis is to be used for pattern recognition in task-oriented agents I decided to not alter the text. When I feed it "london" or "paris" it recognizes those words as proper nouns though. My biggest concern is that the tagger confuses proper nouns with adjectives/adverbs and vice versa adjectives like "spanish" for pronouns, thus creating inconsistent tags above an acceptable or manageable rate. For example, similar adjectives derived from ethnicity such as "italian" or "indian" are correctly categorized, but for some reason "spanish" is not.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/CoreNLP/issues/1141#issuecomment-796406561, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWMFILRVJMCTKXEZF4LTDAYVXANCNFSM4Y7ECZIA .

zapatistas commented 3 years ago

Ok, i will try that, but in cases of adjectives like "spanish" being assigned pronoun labels, do you know why that happens?

AngledLuffa commented 3 years ago

That's bizarre. Can you give an example sentence?

Spanish is supposed to be capitalized, but it shouldn't be labeled a pronoun regardless

On Thu, Mar 11, 2021, 7:29 AM zapatistas @.***> wrote:

Ok, i will try that, but in cases of adjectives like "spanish" being assigned pronoun labels, do you know why that happens?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/CoreNLP/issues/1141#issuecomment-796819532, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWKESZOI24XKU6BLUBLTDDOT5ANCNFSM4Y7ECZIA .

zapatistas commented 3 years ago

One example is :

    can you book a table in rome with spanish cuisine in a expensive price range for two people

I am using the caseless model for this. I will try to use the truecase annotator but I could not find the class, or any relevant method that I can use, outside of the pipeline module, is that correct?

AngledLuffa commented 3 years ago

I got "spanish" as an NNP using the caseless tagger and the sentence you gave us. I agree that it should be JJ considering the usages in the training data. Even more weird is that it gave me "cuisine" tagged NNP as well.

If you send a couple more examples of obvious errors, I can add them to the model, retrain, and send that to you.

zapatistas commented 3 years ago

Ok I will try to do that asap and send them to you. In regards with this problem, my fear is that this may reoccur with other datasets, thus I decided to implement a statistical "patch", where I would consider the most occurred POS tag for a word with more than one POS tags assigned. Using the caseless models, the correct tag dominates the wrong one, but I do not know if that is wise or maybe I should use another approach?

AngledLuffa commented 3 years ago

I don't recommend that in a general case. The whole idea behind the statistical taggers is to use the context to improve the tagging. For example, it should be giving different results for "Spanish cuisine" and "He speaks Spanish". If you are certain that all instances of Spanish in your domain will be the adjective use, though, then that makes sense.

On Fri, Mar 12, 2021 at 7:40 AM zapatistas @.***> wrote:

Ok I will try to do that asap and send them to you. In regards with this problem, my fear is that this may reoccur with other datasets, thus I decided to implement a statistical "patch", where I would consider the most occurred POS tag for a word with more than one POS tags assigned. Using the caseless models, the correct tag dominates the wrong one, but I do not know if that is wise or maybe I should use another approach?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/CoreNLP/issues/1141#issuecomment-797569959, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWJNVMBDIPJZ2YJ5QJDTDIYXTANCNFSM4Y7ECZIA .