stanfordnlp / CoreNLP

CoreNLP: A Java suite of core NLP tools for tokenization, sentence segmentation, NER, parsing, coreference, sentiment analysis, etc.
http://stanfordnlp.github.io/CoreNLP/
GNU General Public License v3.0
9.7k stars 2.7k forks source link

ner.applyFineGrained and PERSON entity annotation #828

Open loretoparisi opened 5 years ago

loretoparisi commented 5 years ago

When using ner.applyFineGrained set to true the NER annotator will get confused in some circumstances like in this phrase

George Washington went to Washington

in this case the term George will have any annotation i.e. a O value in the output:

{
    "sentences": [{
                "index": 0,
                "text": "George Washington went to Washington",
                "line": 1,
                "sentimentValue": "1",
                "tokens": [{
                        "index": 1,
                        "word": "George",
                        "characterOffsetBegin": 0,
                        "characterOffsetEnd": 6,
                        "before": "",
                        "after": " ",
                        "pos": "NNP",
                        "ner": "O",
                        "lemma": "George"
                    },
                    {
                        "index": 2,
                        "word": "Washington",
                        "characterOffsetBegin": 7,
                        "characterOffsetEnd": 17,
                        "before": " ",
                        "after": " ",
                        "pos": "NNP",
                        "ner": "STATE_OR_PROVINCE"
                    },
                    {
                        "index": 3,
                        "word": "went",
                        "characterOffsetBegin": 18,
                        "characterOffsetEnd": 22,
                        "before": " ",
                        "after": " ",
                        "pos": "VBD",
                        "ner": "O"
                    },
                    {
                        "index": 4,
                        "word": "to",
                        "characterOffsetBegin": 23,
                        "characterOffsetEnd": 25,
                        "before": " ",
                        "after": " ",
                        "pos": "TO",
                        "ner": "O"
                    },
                    {
                        "index": 5,
                        "word": "Washington",
                        "characterOffsetBegin": 26,
                        "characterOffsetEnd": 36,
                        "before": " ",
                        "after": "",
                        "pos": "NNP",
                        "ner": "STATE_OR_PROVINCE"
                    }
                ]
            }

While when set to false, the Annotator will correctly detect the NER George, so the output will look like

{
    "sentences": [{
        "index": 0,
        "text": "George Washington went to Washington",
        "line": 1,
        "sentimentValue": "1",
        "tokens": [{
                "index": 1,
                "word": "George",
                "characterOffsetBegin": 0,
                "characterOffsetEnd": 6,
                "before": "",
                "after": " ",
                "pos": "NNP",
                "ner": "PERSON",
                "lemma": "George",
                "phoneme": "ʤɔˈɹʤ",
            },
            {
                "index": 2,
                "word": "Washington",
                "characterOffsetBegin": 7,
                "characterOffsetEnd": 17,
                "before": " ",
                "after": " ",
                "pos": "NNP",
                "ner": "PERSON",
                "lemma": "Washington",
            },
            {
                "index": 3,
                "word": "went",
                "characterOffsetBegin": 18,
                "characterOffsetEnd": 22,
                "before": " ",
                "after": " ",
                "pos": "VBD",
                "ner": "O",
                "lemma": "go"
            },
            {
                "index": 4,
                "word": "to",
                "characterOffsetBegin": 23,
                "characterOffsetEnd": 25,
                "before": " ",
                "after": " ",
                "pos": "TO",
                "ner": "O",
                "lemma": "to"
            },
            {
                "index": 5,
                "word": "Washington",
                "characterOffsetBegin": 26,
                "characterOffsetEnd": 36,
                "before": " ",
                "after": "",
                "pos": "NNP",
                "ner": "LOCATION",
                "lemma": "Washington"
            }
        ]
    }]
}

Any reason for this behavior?

J38 commented 5 years ago

I cannot reproduce this error (using 3.9.2 or GitHub latest code). Could you provide more details about the context?

Command I used:

java -Xmx10g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner -ner.applyFineGrained -file example.txt -outputFormat text
loretoparisi commented 5 years ago

@J38 thanks a lot for the debugging. I digget a bit in the code, and I realized that this happens in this very specific use case:

1) The entity is composed of more than one token (hence George Washington) 2) We use the ner.applyFineGrained with our custom annotator that extends the SentenceAnnotator and it uses the NERClassifierCombiner to recognize the new entity type ARTIST we have defined.

While given the text George went to Washington, Rihanna is an artist, when the entity is a single token (hence George) it works as expected: we recognize both the base PERSON entity and our ARTIST entities:

"annotations": {
    "sentences": [
      {
        "index": 0,
        "text": "George went to Washington, Rihanna is an artist",
        "line": 1,
        "structure": "A0",
        "paragraphIndex": 0,
        "paragraphStructure": "A0",
        "tokens": [
          {
            "index": 1,
            "word": "George",
            "characterOffsetBegin": 0,
            "characterOffsetEnd": 6,
            "before": "",
            "after": " ",
            "pos": "NNP",
            "ner": "PERSON",
            "lemma": "George",
            "snippet": "George went to Washington, Rihanna is an artist",
            "entityDelimiter": "U"
          },
          ...
          {
            "index": 4,
            "word": "Washington",
            "characterOffsetBegin": 15,
            "characterOffsetEnd": 25,
            "before": " ",
            "after": "",
            "pos": "NNP",
            "ner": "STATE_OR_PROVINCE",
            "lemma": "Washington",
            "snippet": "George went to Washington, Rihanna is an artist",
            "entityDelimiter": "U"
          },
          ...
          {
            "index": 6,
            "word": "Rihanna",
            "characterOffsetBegin": 27,
            "characterOffsetEnd": 34,
            "before": " ",
            "after": " ",
            "pos": "NNP",
            "ner": "ARTIST",
            "lemma": "Rihanna",
            "mxmID": "33491890",
            "snippet": "George went to Washington, Rihanna is an artist",
            "entityDelimiter": "U"
          },
...
    ],

In this case we run this configuration of ner.fine.regexner.mapping":

       "ner.applyFineGrained": true,
        "ner.fine.regexner.mapping": "header=true,mxm_nlpdata/mxm_casedentities.tab;ignorecase=true,edu/stanford/nlp/models/kbp/regexner_caseless.tab;edu/stanford/nlp/models/kbp/regexner_cased.tab;ignorecase=true,header=true, mxm_nlpdata/mxm_entities.tab;ignorecase=true,header=true, mxm_nlpdata/mxm_artists.tab;mxm_nlpdata/mxm_labels.tab;ignorecase=true, mxm_nlpdata/mxm_blacklist.tab"  

So it seems that our custom SentenceAnnotator when overrides the annotate method it fails:

@Override
    public void annotate(Annotation annotation) {
        if (VERBOSE) {
            log.info("Adding NER Combiner annotation ... ");
        }

        // if ner.usePresentDateForDocDate is set, use the present date as the doc date
        if (usePresentDateForDocDate) {
            String currentDate =
                    new SimpleDateFormat("yyyy-MM-dd").format(Calendar.getInstance().getTime());
            annotation.set(CoreAnnotations.DocDateAnnotation.class, currentDate);
        }
        // use provided doc date if applicable
        if (!providedDocDate.equals("")) {
            annotation.set(CoreAnnotations.DocDateAnnotation.class, providedDocDate);
        }

        AnnotationsMask mask = new AnnotationsMask(true);

        Annotation maskedAnnotation = mask.decompose(annotation);

        super.annotate(maskedAnnotation);
        this.ner.finalizeAnnotation(maskedAnnotation);

        if (VERBOSE) {
            log.info("done.");
        }
        // if Spanish, run the regexner with Spanish number rules
        if (LanguageInfo.HumanLanguage.SPANISH.equals(language))
            spanishNumberAnnotator.annotate(maskedAnnotation);
        // if fine grained ner is requested, run that
        if (this.applyFineGrained) {
            fineGrainedNERAnnotator.annotate(maskedAnnotation);
            // set the FineGrainedNamedEntityTagAnnotation.class
            for (CoreLabel token : maskedAnnotation.get(CoreAnnotations.TokensAnnotation.class)) {
                String fineGrainedTag = token.get(CoreAnnotations.NamedEntityTagAnnotation.class);
                token.set(CoreAnnotations.FineGrainedNamedEntityTagAnnotation.class, fineGrainedTag);
            }
        }
        // if entity mentions should be built, run that
        if (this.buildEntityMentions)
            entityMentionsAnnotator.annotate(maskedAnnotation);

        Map<Class, Object> mapped_defaults = new HashMap<>();

        mapped_defaults.put(CoreAnnotations.NamedEntityTagAnnotation.class, "O");
        mapped_defaults.put(CoreAnnotations.NormalizedNamedEntityTagAnnotation.class, null);
        mapped_defaults.put(MXMCoreAnnotations.MXMSlangCorrectionAnnotation.class, null);
        mapped_defaults.put(MXMCoreAnnotations.MXMEntityID.class, null);
        mapped_defaults.put(CoreAnnotations.LinkAnnotation.class, null);
        mapped_defaults.put(CoreAnnotations.ValueAnnotation.class, null);
        mapped_defaults.put(TimeExpression.Annotation.class, null);
        mapped_defaults.put(TimeExpression.TimeIndexAnnotation.class, null);
        mapped_defaults.put(CoreAnnotations.DistSimAnnotation.class, null);
        mapped_defaults.put(CoreAnnotations.NumericCompositeTypeAnnotation.class, null);
        mapped_defaults.put(TimeExpression.ChildrenAnnotation.class, null);
        mapped_defaults.put(CoreAnnotations.NumericTypeAnnotation.class, null);
        mapped_defaults.put(CoreAnnotations.ShapeAnnotation.class, null);
        mapped_defaults.put(Tags.TagsAnnotation.class, null);
        mapped_defaults.put(CoreAnnotations.NumerizedTokensAnnotation.class, null);
        mapped_defaults.put(CoreAnnotations.AnswerAnnotation.class, null);
        mapped_defaults.put(CoreAnnotations.NumericCompositeValueAnnotation.class, null);
        mapped_defaults.put(CoreAnnotations.CoarseNamedEntityTagAnnotation.class, null);
        mapped_defaults.put(CoreAnnotations.FineGrainedNamedEntityTagAnnotation.class, null);

        annotation = mask.recompose(annotation, maskedAnnotation, mapped_defaults);

    }
J38 commented 5 years ago

Could you show me the pipeline settings? Did you create a statistical model to tag "ARTIST" ?

Also for reference, here is the latest write up on the NER process, which is pretty detailed about each step:

https://stanfordnlp.github.io/CoreNLP/ner.html

loretoparisi commented 5 years ago

@J38 yes of course. My configuration looks like this

var options = {

        "lang": "en",

        "annotators": "tokenize,mxmssplit,mxmslang,mxmphonetics,mxmsegmenter,mxmpos,mxmlemma,mxmner,mxmsentiment",

        // POS
        "customAnnotatorClass.mxmpos": "musixmatch_nlp.MXMPartOfSpeechAnnotator",

        // LEMMATIZER
        "customAnnotatorClass.mxmlemma": "musixmatch_nlp.MXMMorphaAnnotator",

        // PHONEMES
        "customAnnotatorClass.mxmphonetics": "musixmatch_nlp.MXMPhoneticsAnnotator",

        // SEGMENTER
        "customAnnotatorClass.mxmsegmenter": "musixmatch_nlp.MXMLyricsSegmenterAnnotator",

        // SLANG
        "customAnnotatorClass.mxmslang": "musixmatch_nlp.MXMSlangCorrector",

        // NER
        "customAnnotatorClass.mxmner": "musixmatch_nlp.MXMNERCombinerAnnotator",

        // SPLIT
        "customAnnotatorClass.mxmssplit": "musixmatch_nlp.MXMWordToSentencesAnnotator",

        // SENTIMENT
        "customAnnotatorClass.mxmsentiment": "musixmatch_nlp.MXMSentimentTensorflowAnnotator",

        "mxmphonetics.ipa_dict": "/root/en_cmuipadict.txt",
        "mxmsentiment.model_dir": "/root/blstm_att1530026090",
        "mxmslang.language": "en",
        "ssplit.newlineIsSentenceBreak": "always",

        "ner.applyFineGrained": true,
        "ner.buildEntityMentions": false,

        "ner.fine.regexner.mapping": "header=true,mxm_nlpdata/mxm_casedentities.tab;ignorecase=true,edu/stanford/nlp/models/kbp/regexner_caseless.tab;edu/stanford/nlp/models/kbp/regexner_cased.tab;ignorecase=true,header=true, mxm_nlpdata/mxm_entities.tab;ignorecase=true,header=true, mxm_nlpdata/mxm_artists.tab;mxm_nlpdata/mxm_labels.tab;ignorecase=true, mxm_nlpdata/mxm_blacklist.tab"

    };

We have several class extensions here, while the important stuff here related to the NER classifier is the mxmner and its configuration "musixmatch_nlp.MXMNERCombinerAnnotator". You can find above the Java class that implements the MXMNERCombinerAnnotator that extends the SentenceAnnotator. Basically it normally works and tags the new ARTIST tag. It fails in the case presented above when having these multiple tokens.

loretoparisi commented 5 years ago

@J38 Any idea why this happens? This is my annotate override in the java annotator class

@Override
    public void annotate(Annotation annotation) {
        if (VERBOSE) {
            log.info("Adding NER Combiner annotation ... ");
        }

        // if ner.usePresentDateForDocDate is set, use the present date as the doc date
        if (usePresentDateForDocDate) {
            String currentDate =
                    new SimpleDateFormat("yyyy-MM-dd").format(Calendar.getInstance().getTime());
            annotation.set(CoreAnnotations.DocDateAnnotation.class, currentDate);
        }
        // use provided doc date if applicable
        if (!providedDocDate.equals("")) {
            annotation.set(CoreAnnotations.DocDateAnnotation.class, providedDocDate);
        }

        AnnotationsMask mask = new AnnotationsMask(true);

        Annotation maskedAnnotation = mask.decompose(annotation);

        super.annotate(maskedAnnotation);
        this.ner.finalizeAnnotation(maskedAnnotation);

        if (VERBOSE) {
            log.info("done.");
        }
        // if Spanish, run the regexner with Spanish number rules
        if (LanguageInfo.HumanLanguage.SPANISH.equals(language))
            spanishNumberAnnotator.annotate(maskedAnnotation);
        // if fine grained ner is requested, run that
        if (this.applyFineGrained) {
            fineGrainedNERAnnotator.annotate(maskedAnnotation);
            // set the FineGrainedNamedEntityTagAnnotation.class
            for (CoreLabel token : maskedAnnotation.get(CoreAnnotations.TokensAnnotation.class)) {
                String fineGrainedTag = token.get(CoreAnnotations.NamedEntityTagAnnotation.class);
                token.set(CoreAnnotations.FineGrainedNamedEntityTagAnnotation.class, fineGrainedTag);
            }
        }
        // if entity mentions should be built, run that
        if (this.buildEntityMentions)
            entityMentionsAnnotator.annotate(maskedAnnotation);

        Map<Class, Object> mapped_defaults = new HashMap<>();

        mapped_defaults.put(CoreAnnotations.NamedEntityTagAnnotation.class, "O");
        mapped_defaults.put(CoreAnnotations.NormalizedNamedEntityTagAnnotation.class, null);
        mapped_defaults.put(MXMCoreAnnotations.MXMSlangCorrectionAnnotation.class, null);
        mapped_defaults.put(MXMCoreAnnotations.MXMEntityID.class, null);
        mapped_defaults.put(CoreAnnotations.LinkAnnotation.class, null);
        mapped_defaults.put(CoreAnnotations.ValueAnnotation.class, null);
        mapped_defaults.put(TimeExpression.Annotation.class, null);
        mapped_defaults.put(TimeExpression.TimeIndexAnnotation.class, null);
        mapped_defaults.put(CoreAnnotations.DistSimAnnotation.class, null);
        mapped_defaults.put(CoreAnnotations.NumericCompositeTypeAnnotation.class, null);
        mapped_defaults.put(TimeExpression.ChildrenAnnotation.class, null);
        mapped_defaults.put(CoreAnnotations.NumericTypeAnnotation.class, null);
        mapped_defaults.put(CoreAnnotations.ShapeAnnotation.class, null);
        mapped_defaults.put(Tags.TagsAnnotation.class, null);
        mapped_defaults.put(CoreAnnotations.NumerizedTokensAnnotation.class, null);
        mapped_defaults.put(CoreAnnotations.AnswerAnnotation.class, null);
        mapped_defaults.put(CoreAnnotations.NumericCompositeValueAnnotation.class, null);
        mapped_defaults.put(CoreAnnotations.CoarseNamedEntityTagAnnotation.class, null);
        mapped_defaults.put(CoreAnnotations.FineGrainedNamedEntityTagAnnotation.class, null);

        annotation = mask.recompose(annotation, maskedAnnotation, mapped_defaults);

    }