Open loretoparisi opened 5 years ago
I cannot reproduce this error (using 3.9.2 or GitHub latest code). Could you provide more details about the context?
Command I used:
java -Xmx10g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner -ner.applyFineGrained -file example.txt -outputFormat text
@J38 thanks a lot for the debugging. I digget a bit in the code, and I realized that this happens in this very specific use case:
1) The entity is composed of more than one token (hence George Washington
)
2) We use the ner.applyFineGrained
with our custom annotator that extends the SentenceAnnotator
and it uses the NERClassifierCombiner
to recognize the new entity type ARTIST we have defined.
While given the text George went to Washington, Rihanna is an artist
, when the entity is a single token (hence George
) it works as expected: we recognize both the base PERSON entity and our ARTIST entities:
"annotations": {
"sentences": [
{
"index": 0,
"text": "George went to Washington, Rihanna is an artist",
"line": 1,
"structure": "A0",
"paragraphIndex": 0,
"paragraphStructure": "A0",
"tokens": [
{
"index": 1,
"word": "George",
"characterOffsetBegin": 0,
"characterOffsetEnd": 6,
"before": "",
"after": " ",
"pos": "NNP",
"ner": "PERSON",
"lemma": "George",
"snippet": "George went to Washington, Rihanna is an artist",
"entityDelimiter": "U"
},
...
{
"index": 4,
"word": "Washington",
"characterOffsetBegin": 15,
"characterOffsetEnd": 25,
"before": " ",
"after": "",
"pos": "NNP",
"ner": "STATE_OR_PROVINCE",
"lemma": "Washington",
"snippet": "George went to Washington, Rihanna is an artist",
"entityDelimiter": "U"
},
...
{
"index": 6,
"word": "Rihanna",
"characterOffsetBegin": 27,
"characterOffsetEnd": 34,
"before": " ",
"after": " ",
"pos": "NNP",
"ner": "ARTIST",
"lemma": "Rihanna",
"mxmID": "33491890",
"snippet": "George went to Washington, Rihanna is an artist",
"entityDelimiter": "U"
},
...
],
In this case we run this configuration of ner.fine.regexner.mapping"
:
"ner.applyFineGrained": true,
"ner.fine.regexner.mapping": "header=true,mxm_nlpdata/mxm_casedentities.tab;ignorecase=true,edu/stanford/nlp/models/kbp/regexner_caseless.tab;edu/stanford/nlp/models/kbp/regexner_cased.tab;ignorecase=true,header=true, mxm_nlpdata/mxm_entities.tab;ignorecase=true,header=true, mxm_nlpdata/mxm_artists.tab;mxm_nlpdata/mxm_labels.tab;ignorecase=true, mxm_nlpdata/mxm_blacklist.tab"
So it seems that our custom SentenceAnnotator
when overrides the annotate
method it fails:
@Override
public void annotate(Annotation annotation) {
if (VERBOSE) {
log.info("Adding NER Combiner annotation ... ");
}
// if ner.usePresentDateForDocDate is set, use the present date as the doc date
if (usePresentDateForDocDate) {
String currentDate =
new SimpleDateFormat("yyyy-MM-dd").format(Calendar.getInstance().getTime());
annotation.set(CoreAnnotations.DocDateAnnotation.class, currentDate);
}
// use provided doc date if applicable
if (!providedDocDate.equals("")) {
annotation.set(CoreAnnotations.DocDateAnnotation.class, providedDocDate);
}
AnnotationsMask mask = new AnnotationsMask(true);
Annotation maskedAnnotation = mask.decompose(annotation);
super.annotate(maskedAnnotation);
this.ner.finalizeAnnotation(maskedAnnotation);
if (VERBOSE) {
log.info("done.");
}
// if Spanish, run the regexner with Spanish number rules
if (LanguageInfo.HumanLanguage.SPANISH.equals(language))
spanishNumberAnnotator.annotate(maskedAnnotation);
// if fine grained ner is requested, run that
if (this.applyFineGrained) {
fineGrainedNERAnnotator.annotate(maskedAnnotation);
// set the FineGrainedNamedEntityTagAnnotation.class
for (CoreLabel token : maskedAnnotation.get(CoreAnnotations.TokensAnnotation.class)) {
String fineGrainedTag = token.get(CoreAnnotations.NamedEntityTagAnnotation.class);
token.set(CoreAnnotations.FineGrainedNamedEntityTagAnnotation.class, fineGrainedTag);
}
}
// if entity mentions should be built, run that
if (this.buildEntityMentions)
entityMentionsAnnotator.annotate(maskedAnnotation);
Map<Class, Object> mapped_defaults = new HashMap<>();
mapped_defaults.put(CoreAnnotations.NamedEntityTagAnnotation.class, "O");
mapped_defaults.put(CoreAnnotations.NormalizedNamedEntityTagAnnotation.class, null);
mapped_defaults.put(MXMCoreAnnotations.MXMSlangCorrectionAnnotation.class, null);
mapped_defaults.put(MXMCoreAnnotations.MXMEntityID.class, null);
mapped_defaults.put(CoreAnnotations.LinkAnnotation.class, null);
mapped_defaults.put(CoreAnnotations.ValueAnnotation.class, null);
mapped_defaults.put(TimeExpression.Annotation.class, null);
mapped_defaults.put(TimeExpression.TimeIndexAnnotation.class, null);
mapped_defaults.put(CoreAnnotations.DistSimAnnotation.class, null);
mapped_defaults.put(CoreAnnotations.NumericCompositeTypeAnnotation.class, null);
mapped_defaults.put(TimeExpression.ChildrenAnnotation.class, null);
mapped_defaults.put(CoreAnnotations.NumericTypeAnnotation.class, null);
mapped_defaults.put(CoreAnnotations.ShapeAnnotation.class, null);
mapped_defaults.put(Tags.TagsAnnotation.class, null);
mapped_defaults.put(CoreAnnotations.NumerizedTokensAnnotation.class, null);
mapped_defaults.put(CoreAnnotations.AnswerAnnotation.class, null);
mapped_defaults.put(CoreAnnotations.NumericCompositeValueAnnotation.class, null);
mapped_defaults.put(CoreAnnotations.CoarseNamedEntityTagAnnotation.class, null);
mapped_defaults.put(CoreAnnotations.FineGrainedNamedEntityTagAnnotation.class, null);
annotation = mask.recompose(annotation, maskedAnnotation, mapped_defaults);
}
Could you show me the pipeline settings? Did you create a statistical model to tag "ARTIST" ?
Also for reference, here is the latest write up on the NER process, which is pretty detailed about each step:
@J38 yes of course. My configuration looks like this
var options = {
"lang": "en",
"annotators": "tokenize,mxmssplit,mxmslang,mxmphonetics,mxmsegmenter,mxmpos,mxmlemma,mxmner,mxmsentiment",
// POS
"customAnnotatorClass.mxmpos": "musixmatch_nlp.MXMPartOfSpeechAnnotator",
// LEMMATIZER
"customAnnotatorClass.mxmlemma": "musixmatch_nlp.MXMMorphaAnnotator",
// PHONEMES
"customAnnotatorClass.mxmphonetics": "musixmatch_nlp.MXMPhoneticsAnnotator",
// SEGMENTER
"customAnnotatorClass.mxmsegmenter": "musixmatch_nlp.MXMLyricsSegmenterAnnotator",
// SLANG
"customAnnotatorClass.mxmslang": "musixmatch_nlp.MXMSlangCorrector",
// NER
"customAnnotatorClass.mxmner": "musixmatch_nlp.MXMNERCombinerAnnotator",
// SPLIT
"customAnnotatorClass.mxmssplit": "musixmatch_nlp.MXMWordToSentencesAnnotator",
// SENTIMENT
"customAnnotatorClass.mxmsentiment": "musixmatch_nlp.MXMSentimentTensorflowAnnotator",
"mxmphonetics.ipa_dict": "/root/en_cmuipadict.txt",
"mxmsentiment.model_dir": "/root/blstm_att1530026090",
"mxmslang.language": "en",
"ssplit.newlineIsSentenceBreak": "always",
"ner.applyFineGrained": true,
"ner.buildEntityMentions": false,
"ner.fine.regexner.mapping": "header=true,mxm_nlpdata/mxm_casedentities.tab;ignorecase=true,edu/stanford/nlp/models/kbp/regexner_caseless.tab;edu/stanford/nlp/models/kbp/regexner_cased.tab;ignorecase=true,header=true, mxm_nlpdata/mxm_entities.tab;ignorecase=true,header=true, mxm_nlpdata/mxm_artists.tab;mxm_nlpdata/mxm_labels.tab;ignorecase=true, mxm_nlpdata/mxm_blacklist.tab"
};
We have several class extensions here, while the important stuff here related to the NER classifier is the mxmner
and its configuration "musixmatch_nlp.MXMNERCombinerAnnotator"
.
You can find above the Java class that implements the MXMNERCombinerAnnotator
that extends the SentenceAnnotator
.
Basically it normally works and tags the new ARTIST tag. It fails in the case presented above when having these multiple tokens.
@J38 Any idea why this happens? This is my annotate
override in the java annotator class
@Override
public void annotate(Annotation annotation) {
if (VERBOSE) {
log.info("Adding NER Combiner annotation ... ");
}
// if ner.usePresentDateForDocDate is set, use the present date as the doc date
if (usePresentDateForDocDate) {
String currentDate =
new SimpleDateFormat("yyyy-MM-dd").format(Calendar.getInstance().getTime());
annotation.set(CoreAnnotations.DocDateAnnotation.class, currentDate);
}
// use provided doc date if applicable
if (!providedDocDate.equals("")) {
annotation.set(CoreAnnotations.DocDateAnnotation.class, providedDocDate);
}
AnnotationsMask mask = new AnnotationsMask(true);
Annotation maskedAnnotation = mask.decompose(annotation);
super.annotate(maskedAnnotation);
this.ner.finalizeAnnotation(maskedAnnotation);
if (VERBOSE) {
log.info("done.");
}
// if Spanish, run the regexner with Spanish number rules
if (LanguageInfo.HumanLanguage.SPANISH.equals(language))
spanishNumberAnnotator.annotate(maskedAnnotation);
// if fine grained ner is requested, run that
if (this.applyFineGrained) {
fineGrainedNERAnnotator.annotate(maskedAnnotation);
// set the FineGrainedNamedEntityTagAnnotation.class
for (CoreLabel token : maskedAnnotation.get(CoreAnnotations.TokensAnnotation.class)) {
String fineGrainedTag = token.get(CoreAnnotations.NamedEntityTagAnnotation.class);
token.set(CoreAnnotations.FineGrainedNamedEntityTagAnnotation.class, fineGrainedTag);
}
}
// if entity mentions should be built, run that
if (this.buildEntityMentions)
entityMentionsAnnotator.annotate(maskedAnnotation);
Map<Class, Object> mapped_defaults = new HashMap<>();
mapped_defaults.put(CoreAnnotations.NamedEntityTagAnnotation.class, "O");
mapped_defaults.put(CoreAnnotations.NormalizedNamedEntityTagAnnotation.class, null);
mapped_defaults.put(MXMCoreAnnotations.MXMSlangCorrectionAnnotation.class, null);
mapped_defaults.put(MXMCoreAnnotations.MXMEntityID.class, null);
mapped_defaults.put(CoreAnnotations.LinkAnnotation.class, null);
mapped_defaults.put(CoreAnnotations.ValueAnnotation.class, null);
mapped_defaults.put(TimeExpression.Annotation.class, null);
mapped_defaults.put(TimeExpression.TimeIndexAnnotation.class, null);
mapped_defaults.put(CoreAnnotations.DistSimAnnotation.class, null);
mapped_defaults.put(CoreAnnotations.NumericCompositeTypeAnnotation.class, null);
mapped_defaults.put(TimeExpression.ChildrenAnnotation.class, null);
mapped_defaults.put(CoreAnnotations.NumericTypeAnnotation.class, null);
mapped_defaults.put(CoreAnnotations.ShapeAnnotation.class, null);
mapped_defaults.put(Tags.TagsAnnotation.class, null);
mapped_defaults.put(CoreAnnotations.NumerizedTokensAnnotation.class, null);
mapped_defaults.put(CoreAnnotations.AnswerAnnotation.class, null);
mapped_defaults.put(CoreAnnotations.NumericCompositeValueAnnotation.class, null);
mapped_defaults.put(CoreAnnotations.CoarseNamedEntityTagAnnotation.class, null);
mapped_defaults.put(CoreAnnotations.FineGrainedNamedEntityTagAnnotation.class, null);
annotation = mask.recompose(annotation, maskedAnnotation, mapped_defaults);
}
When using
ner.applyFineGrained
set totrue
the NER annotator will get confused in some circumstances like in this phrasein this case the term
George
will have any annotation i.e. aO
value in the output:While when set to
false
, the Annotator will correctly detect the NERGeorge
, so the output will look likeAny reason for this behavior?