stanfordnlp / CoreNLP

CoreNLP: A Java suite of core NLP tools for tokenization, sentence segmentation, NER, parsing, coreference, sentiment analysis, etc.
http://stanfordnlp.github.io/CoreNLP/
GNU General Public License v3.0
9.66k stars 2.7k forks source link

Android: NER NullPointerException on some models #961

Open erksch opened 4 years ago

erksch commented 4 years ago

I (somewhat) successfully integrated CoreNLP (3.9.2) in an Android app. The following annotator configuration works just fine:

props.setProperty("annotators", "tokenize,ssplit,pos,lemma")

But as soon as I add the NER annotator I start to get the following error:

Caused by: java.lang.NullPointerException: Attempt to invoke interface method 'int java.util.List.size()' on a null object reference
        at edu.stanford.nlp.util.HashIndex.size(HashIndex.java:94)
        at edu.stanford.nlp.ie.crf.CRFClassifier.getCliqueTree(CRFClassifier.java:1499)
        at edu.stanford.nlp.ie.crf.CRFClassifier.getSequenceModel(CRFClassifier.java:1190)
        at edu.stanford.nlp.ie.crf.CRFClassifier.getSequenceModel(CRFClassifier.java:1186)
        at edu.stanford.nlp.ie.crf.CRFClassifier.classifyMaxEnt(CRFClassifier.java:1218)
        at edu.stanford.nlp.ie.crf.CRFClassifier.classify(CRFClassifier.java:1128)
        at edu.stanford.nlp.ie.AbstractSequenceClassifier.classifySentence(AbstractSequenceClassifier.java:299)
        at edu.stanford.nlp.ie.ClassifierCombiner.classify(ClassifierCombiner.java:476)
        at edu.stanford.nlp.ie.NERClassifierCombiner.classifyWithGlobalInformation(NERClassifierCombiner.java:269)
        at edu.stanford.nlp.ie.AbstractSequenceClassifier.classifySentenceWithGlobalInformation(AbstractSequenceClassifier.java:343)
        at edu.stanford.nlp.pipeline.NERCombinerAnnotator.doOneSentence(NERCombinerAnnotator.java:368)
        at edu.stanford.nlp.pipeline.SentenceAnnotator.annotate(SentenceAnnotator.java:102)
        at edu.stanford.nlp.pipeline.NERCombinerAnnotator.annotate(NERCombinerAnnotator.java:310)
        at edu.stanford.nlp.pipeline.AnnotationPipeline.annotate(AnnotationPipeline.java:76)
        at edu.stanford.nlp.pipeline.StanfordCoreNLP.annotate(StanfordCoreNLP.java:637)
        at edu.stanford.nlp.pipeline.StanfordCoreNLP.annotate(StanfordCoreNLP.java:629)

The code I use (Kotlin):

val props = Properties()
props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner")
pipeline = StanfordCoreNLP(props)
val document = CoreDocument("Joe Smith is from Seattle.")
pipeline.annotate(document)

The error is very similar to the one described in this issue where the author tried to use the parser annotator.

Debugging

I debugged the stack trace and found that the error is caused by this line (on classIndex.size()) in CRFClassifier:1480:

return CRFCliqueTree.getCalibratedCliqueTree(data, labelIndices, classIndex.size(), 
  classIndex, flags.backgroundSymbol, getCliquePotentialFunctionForTest(), featureVal);

Meaning classIndex is null and was not initialized properly.

The classIndex property of CRFClassifier is initialized in the loadClassifier(ObjectInputStream ois, Properties props) method:

public void loadClassifier(ObjectInputStream ois, Properties props) {
    Object o = ois.readObject();
    [...]
    classIndex = (Index<String>) ois.readObject();

I found out that the passed ObjectInputStream is effectively a stream on the file from a model path that is determined in the NERCombinerAnnotator constructor:

public NERCombinerAnnotator(Properties properties) throws IOException {
    List<String> models = new ArrayList<>();
    String modelNames = properties.getProperty("ner.model");
    if (modelNames == null) {
      modelNames = DefaultPaths.DEFAULT_NER_THREECLASS_MODEL + ',' + DefaultPaths.DEFAULT_NER_MUC_MODEL + ',' + DefaultPaths.DEFAULT_NER_CONLL_MODEL;
    }
    [...]
    String[] loadPaths = models.toArray(new String[models.size()]);

Those loadPaths are iterated in the loadClassifiers method in ClassifierCombiner:

 private void loadClassifiers(Properties props, List<String> paths) throws IOException {
    baseClassifiers = new ArrayList<>();
    [...]
    for(String path: paths) {
      AbstractSequenceClassifier<IN> cls = loadClassifierFromPath(props, path);
      baseClassifiers.add(cls);
      [...]
    }

By adding a breakpoint to this method I found out that the first model path (DefaultPaths.DEFAULT_NER_THREECLASS_MODEL) in the first iteration of the for-loop is loaded without problems and the classIndex property is set correctly:

Bildschirmfoto vom 2019-10-27 23-39-33

But in the second iteration, when loading from DefaultPaths.DEFAULT_NER_MUC_MODEL, it fails:

Bildschirmfoto vom 2019-10-27 23-43-18

Workaround

My current workaround is to just set the ner model to only the threeclass and conll model:

props.setProperty("ner.model", DefaultPaths.DEFAULT_NER_THREECLASS_MODEL + "," + DefaultPaths.DEFAULT_NER_CONLL_MODEL)

But I actually don't what the consequences are if the MUC model is missing.

Explanation

My theory is that the MUC model is especially large and thus can not be loaded into memory on a mobile device. Is that true? How big is the model in particular? But when I monitor the memory consumption of the app I don't spot anything critical, the app stays under 512 MB before it crashes.

AngledLuffa commented 4 years ago

MUC is smaller than all or conll in terms of file size, but it has 7 classes, so this may increase the memory footprint when loaded. What happens if you only load the MUC model?

Perhaps there is something silently failing when loading that model in a low memory environment.

On Sun, Oct 27, 2019 at 3:49 PM erksch notifications@github.com wrote:

I (somewhat) successfully integrated CoreNLP in an Android app. The following annotator configuration works just fine:

props.setProperty("annotators", "tokenize,ssplit,pos,lemma")

But as soon as I add the NER annotator I start to get the following error:

Caused by: java.lang.NullPointerException: Attempt to invoke interface method 'int java.util.List.size()' on a null object reference at edu.stanford.nlp.util.HashIndex.size(HashIndex.java:94) at edu.stanford.nlp.ie.crf.CRFClassifier.getCliqueTree(CRFClassifier.java:1499) at edu.stanford.nlp.ie.crf.CRFClassifier.getSequenceModel(CRFClassifier.java:1190) at edu.stanford.nlp.ie.crf.CRFClassifier.getSequenceModel(CRFClassifier.java:1186) at edu.stanford.nlp.ie.crf.CRFClassifier.classifyMaxEnt(CRFClassifier.java:1218) at edu.stanford.nlp.ie.crf.CRFClassifier.classify(CRFClassifier.java:1128) at edu.stanford.nlp.ie.AbstractSequenceClassifier.classifySentence(AbstractSequenceClassifier.java:299) at edu.stanford.nlp.ie.ClassifierCombiner.classify(ClassifierCombiner.java:476) at edu.stanford.nlp.ie.NERClassifierCombiner.classifyWithGlobalInformation(NERClassifierCombiner.java:269) at edu.stanford.nlp.ie.AbstractSequenceClassifier.classifySentenceWithGlobalInformation(AbstractSequenceClassifier.java:343) at edu.stanford.nlp.pipeline.NERCombinerAnnotator.doOneSentence(NERCombinerAnnotator.java:368) at edu.stanford.nlp.pipeline.SentenceAnnotator.annotate(SentenceAnnotator.java:102) at edu.stanford.nlp.pipeline.NERCombinerAnnotator.annotate(NERCombinerAnnotator.java:310) at edu.stanford.nlp.pipeline.AnnotationPipeline.annotate(AnnotationPipeline.java:76) at edu.stanford.nlp.pipeline.StanfordCoreNLP.annotate(StanfordCoreNLP.java:637) at edu.stanford.nlp.pipeline.StanfordCoreNLP.annotate(StanfordCoreNLP.java:629)

The code I use (Kotlin):

val props = Properties() props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner") pipeline = StanfordCoreNLP(props)val document = CoreDocument("Joe Smith is from Seattle.") pipeline.annotate(document)

The error is very similar to the one described in this issue https://github.com/stanfordnlp/CoreNLP/issues/861 where the author tried to use the parser annotator. Debugging

I debugged the stack trace and found that the error is caused by this line (on classIndex.size()) in CRFClassifier:1480 https://github.com/stanfordnlp/CoreNLP/blob/eb43d5d9150de97f8061fa06b838f1d021586789/src/edu/stanford/nlp/ie/crf/CRFClassifier.java#L1480 :

return CRFCliqueTree.getCalibratedCliqueTree(data, labelIndices, classIndex.size(), classIndex, flags.backgroundSymbol, getCliquePotentialFunctionForTest(), featureVal);

Meaning classIndex is null and must have not been initialized properly.

The classIndex property of CRFClassifier is initialized in the loadClassifier(ObjectInputStream ois, Properties props) https://github.com/stanfordnlp/CoreNLP/blob/eb43d5d9150de97f8061fa06b838f1d021586789/src/edu/stanford/nlp/ie/crf/CRFClassifier.java#L2570 method:

public void loadClassifier(ObjectInputStream ois, Properties props) { Object o = ois.readObject(); [...] classIndex = (Index) ois.readObject();

I found out that the passed ObjectInputStream is effectively a stream on the file from a model path that is determined in the NERCombinerAnnotator constructor:

public NERCombinerAnnotator(Properties properties) throws IOException { List models = new ArrayList<>(); String modelNames = properties.getProperty("ner.model"); if (modelNames == null) { modelNames = DefaultPaths.DEFAULT_NER_THREECLASS_MODEL + ',' + DefaultPaths.DEFAULT_NER_MUC_MODEL + ',' + DefaultPaths.DEFAULT_NER_CONLL_MODEL; } [...] String[] loadPaths = models.toArray(new String[models.size()]);

Those loadPaths are iterated in the loadClassifiers method in ClassifierCombiner:

private void loadClassifiers(Properties props, List paths) throws IOException { baseClassifiers = new ArrayList<>(); [...] for(String path: paths) { AbstractSequenceClassifier cls = loadClassifierFromPath(props, path); baseClassifiers.add(cls); [...] }

When adding a breakpoint to this method I found out that the first model path (DefaultPaths.DEFAULT_NER_THREECLASS_MODEL) in the first iteration of the for-loop is loaded without problems and the classIndex property is set correctly:

[image: Bildschirmfoto vom 2019-10-27 23-39-33] https://user-images.githubusercontent.com/19290349/67643066-2a3f5800-f913-11e9-93a5-d4a9c44b8b02.png

And as you can see at the bottom of the screenshot, the classIndex is initialized properly.

But in the second iteration, when loading from DefaultPaths.DEFAULT_NER_MUC_MODEL, it fails:

[image: Bildschirmfoto vom 2019-10-27 23-43-18] https://user-images.githubusercontent.com/19290349/67643095-9457fd00-f913-11e9-94f0-6bf9e8d1d6c2.png Workaround

My current workaround is to just set the ner model to only the threeclass and conll model:

props.setProperty("ner.model", DefaultPaths.DEFAULT_NER_THREECLASS_MODEL + "," + DefaultPaths.DEFAULT_NER_CONLL_MODEL)

But I actually don't what the consequences are if the MUC model is missing.

My theory is that the MUC model is especially large and thus can not be loaded into memory on a mobile device. Is that true? How big is the model in particular?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/CoreNLP/issues/961?email_source=notifications&email_token=AA2AYWMEMUCFNNYVLOHSL63QQYEGBA5CNFSM4JFTZIOKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HUUI2KA, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWLVAQOH4CD7645B7ATQQYEGBANCNFSM4JFTZIOA .

AngledLuffa commented 4 years ago

On my Linux box, it seems that the all3 model needs more memory than the muc7 model. Hopefully the muc model loads correctly when by itself, in which case the problem is most likely low memory.

I tried various lower than expected amounts of memory and couldn't trigger this specific error, though. It was always GC or OOM.

On Sun, Oct 27, 2019 at 5:00 PM John Bauer horatio@gmail.com wrote:

MUC is smaller than all or conll in terms of file size, but it has 7 classes, so this may increase the memory footprint when loaded. What happens if you only load the MUC model?

Perhaps there is something silently failing when loading that model in a low memory environment.

On Sun, Oct 27, 2019 at 3:49 PM erksch notifications@github.com wrote:

I (somewhat) successfully integrated CoreNLP in an Android app. The following annotator configuration works just fine:

props.setProperty("annotators", "tokenize,ssplit,pos,lemma")

But as soon as I add the NER annotator I start to get the following error:

Caused by: java.lang.NullPointerException: Attempt to invoke interface method 'int java.util.List.size()' on a null object reference at edu.stanford.nlp.util.HashIndex.size(HashIndex.java:94) at edu.stanford.nlp.ie.crf.CRFClassifier.getCliqueTree(CRFClassifier.java:1499) at edu.stanford.nlp.ie.crf.CRFClassifier.getSequenceModel(CRFClassifier.java:1190) at edu.stanford.nlp.ie.crf.CRFClassifier.getSequenceModel(CRFClassifier.java:1186) at edu.stanford.nlp.ie.crf.CRFClassifier.classifyMaxEnt(CRFClassifier.java:1218) at edu.stanford.nlp.ie.crf.CRFClassifier.classify(CRFClassifier.java:1128) at edu.stanford.nlp.ie.AbstractSequenceClassifier.classifySentence(AbstractSequenceClassifier.java:299) at edu.stanford.nlp.ie.ClassifierCombiner.classify(ClassifierCombiner.java:476) at edu.stanford.nlp.ie.NERClassifierCombiner.classifyWithGlobalInformation(NERClassifierCombiner.java:269) at edu.stanford.nlp.ie.AbstractSequenceClassifier.classifySentenceWithGlobalInformation(AbstractSequenceClassifier.java:343) at edu.stanford.nlp.pipeline.NERCombinerAnnotator.doOneSentence(NERCombinerAnnotator.java:368) at edu.stanford.nlp.pipeline.SentenceAnnotator.annotate(SentenceAnnotator.java:102) at edu.stanford.nlp.pipeline.NERCombinerAnnotator.annotate(NERCombinerAnnotator.java:310) at edu.stanford.nlp.pipeline.AnnotationPipeline.annotate(AnnotationPipeline.java:76) at edu.stanford.nlp.pipeline.StanfordCoreNLP.annotate(StanfordCoreNLP.java:637) at edu.stanford.nlp.pipeline.StanfordCoreNLP.annotate(StanfordCoreNLP.java:629)

The code I use (Kotlin):

val props = Properties() props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner") pipeline = StanfordCoreNLP(props)val document = CoreDocument("Joe Smith is from Seattle.") pipeline.annotate(document)

The error is very similar to the one described in this issue https://github.com/stanfordnlp/CoreNLP/issues/861 where the author tried to use the parser annotator. Debugging

I debugged the stack trace and found that the error is caused by this line (on classIndex.size()) in CRFClassifier:1480 https://github.com/stanfordnlp/CoreNLP/blob/eb43d5d9150de97f8061fa06b838f1d021586789/src/edu/stanford/nlp/ie/crf/CRFClassifier.java#L1480 :

return CRFCliqueTree.getCalibratedCliqueTree(data, labelIndices, classIndex.size(), classIndex, flags.backgroundSymbol, getCliquePotentialFunctionForTest(), featureVal);

Meaning classIndex is null and must have not been initialized properly.

The classIndex property of CRFClassifier is initialized in the loadClassifier(ObjectInputStream ois, Properties props) https://github.com/stanfordnlp/CoreNLP/blob/eb43d5d9150de97f8061fa06b838f1d021586789/src/edu/stanford/nlp/ie/crf/CRFClassifier.java#L2570 method:

public void loadClassifier(ObjectInputStream ois, Properties props) { Object o = ois.readObject(); [...] classIndex = (Index) ois.readObject();

I found out that the passed ObjectInputStream is effectively a stream on the file from a model path that is determined in the NERCombinerAnnotator constructor:

public NERCombinerAnnotator(Properties properties) throws IOException { List models = new ArrayList<>(); String modelNames = properties.getProperty("ner.model"); if (modelNames == null) { modelNames = DefaultPaths.DEFAULT_NER_THREECLASS_MODEL + ',' + DefaultPaths.DEFAULT_NER_MUC_MODEL + ',' + DefaultPaths.DEFAULT_NER_CONLL_MODEL; } [...] String[] loadPaths = models.toArray(new String[models.size()]);

Those loadPaths are iterated in the loadClassifiers method in ClassifierCombiner:

private void loadClassifiers(Properties props, List paths) throws IOException { baseClassifiers = new ArrayList<>(); [...] for(String path: paths) { AbstractSequenceClassifier cls = loadClassifierFromPath(props, path); baseClassifiers.add(cls); [...] }

When adding a breakpoint to this method I found out that the first model path (DefaultPaths.DEFAULT_NER_THREECLASS_MODEL) in the first iteration of the for-loop is loaded without problems and the classIndex property is set correctly:

[image: Bildschirmfoto vom 2019-10-27 23-39-33] https://user-images.githubusercontent.com/19290349/67643066-2a3f5800-f913-11e9-93a5-d4a9c44b8b02.png

And as you can see at the bottom of the screenshot, the classIndex is initialized properly.

But in the second iteration, when loading from DefaultPaths.DEFAULT_NER_MUC_MODEL, it fails:

[image: Bildschirmfoto vom 2019-10-27 23-43-18] https://user-images.githubusercontent.com/19290349/67643095-9457fd00-f913-11e9-94f0-6bf9e8d1d6c2.png Workaround

My current workaround is to just set the ner model to only the threeclass and conll model:

props.setProperty("ner.model", DefaultPaths.DEFAULT_NER_THREECLASS_MODEL + "," + DefaultPaths.DEFAULT_NER_CONLL_MODEL)

But I actually don't what the consequences are if the MUC model is missing.

My theory is that the MUC model is especially large and thus can not be loaded into memory on a mobile device. Is that true? How big is the model in particular?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/CoreNLP/issues/961?email_source=notifications&email_token=AA2AYWMEMUCFNNYVLOHSL63QQYEGBA5CNFSM4JFTZIOKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HUUI2KA, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWLVAQOH4CD7645B7ATQQYEGBANCNFSM4JFTZIOA .

erksch commented 4 years ago

When applying the MUC only I get the same as above. Debugging the baseClassifier array it looks like this: Bildschirmfoto vom 2019-10-28 17-41-52 As you can see the same errors as above.

If the model is smaller then maybe it's not due to memory... Is there something fundamentally different between MUC and THREECLASS, CONLL? By the way, I use the latest models from the maven repository.

PS: I now use my custom NER models anyway and it works like a charm without any problems! Thank you very much for this wonderful library and for enabling me to accomplish offline NLP & NER for Android.

AngledLuffa commented 4 years ago

Here's a random question. Are you certain the initialization is complete at this point? I've found that in certain cases Java can spend a long time trying to GC the model loading process when it's just barely out of memory. I don't know of any reason why you would get a null pointer here without having gotten a GC or OOM exception first, but perhaps the issue is it hasn't given up and died by the time you try to query it.

Frankly the odds of us being able to diagnose it are pretty low considering none of us do Android development. As I said in the parser issue, changing the memory requirements locally doesn't seem to get this particular bug. I just always get a GC or OOM exception. Without being able to reproduce it locally, I'm not sure how we can fix it.

Actually, are you a licensee? I'm not a lawyer, but it sounds like you're developing an app where you want to redistribute corenlp, and if that's a proprietary or paid app, that may not agree with our license restrictions.

https://stanfordnlp.github.io/CoreNLP/#license

If you were a licensee, that might give us some budget to invest time & effort in debugging android specific issues. Hint, hint.

On Mon, Oct 28, 2019 at 9:45 AM erksch notifications@github.com wrote:

When applying the MUC only I get the same as above. Debugging the baseClassifier array it looks like this: [image: Bildschirmfoto vom 2019-10-28 17-41-52] https://user-images.githubusercontent.com/19290349/67698229-45a97200-f9aa-11e9-8fca-4d0e001abfde.png As you can see the same errors as above.

If the model is smaller then maybe it's not due to memory... Is there something fundamentally different between MUC and THREECLASS, CONLL?

PS: I now use my custom NER models anyway and it works like a charm without any problems! Thank you very much for this wonderful library and for enabling me to accomplish offline NLP & NER for Android.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/CoreNLP/issues/961?email_source=notifications&email_token=AA2AYWII2JZONEEDWKQDGMDQQ4JLXA5CNFSM4JFTZIOKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECNSIFY#issuecomment-547038231, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWOR4NTYO3JZLFPLSULQQ4JLXANCNFSM4JFTZIOA .

erksch commented 4 years ago

@AngledLuffa Thanks for the hint with the license. I am not a licensee (yet) but will get in contact once everything works as expected.

But remember that due to the required minSDKVersion of 26 (8% of devices) using CoreNLP for B2C Android apps is not really an option. Maybe if this would work for more devices you would license more software hint hint ;)

AngledLuffa commented 4 years ago

Java 8 is 5.5 years old at this point, and it has a lot of features which really make programming easier. Considering we're usually developing for newer, beefier machines, it would be very limiting to hold ourselves back in terms of java version.

I understand the desire to not need a network connection to use the models, but you might need to consider it if only 8% of devices can run the models locally.

Having said that, I really don't understand how this particular error is happening. Was there any luck investigating if the initialization is complete? Like I mentioned, it doesn't make a lot of sense for everything to be null like that without some sort of error showing up while it was initializing.

On Tue, Oct 29, 2019 at 1:23 AM erksch notifications@github.com wrote:

@AngledLuffa https://github.com/AngledLuffa Thanks for the hint with the license. I am not a licensee (yet) but will get in contact once everything works as expected.

But remember that due to the required minSDKVersion of 26 (8% of devices) using CoreNLP for B2C Android apps is not really an option. Maybe if this would work for more devices you would license more software hint hint ;)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/CoreNLP/issues/961?email_source=notifications&email_token=AA2AYWPVAYKUU7FNXF43OP3QQ7XH5A5CNFSM4JFTZIOKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECPUEFA#issuecomment-547308052, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWPNJZRAECLRORW7HH3QQ7XH5ANCNFSM4JFTZIOA .

J38 commented 4 years ago

@erksch congratulations on being the first person I've seen in 4+ years to report running NER locally on an Android phone!

AngledLuffa commented 4 years ago

As an update, CoreNLP 4.0.0 uses less memory for NER than previous versions, and there have been even more optimizations in the master branch since the 4.0.0 release. Do you have any interest in retrying the MUC model?