Open erksch opened 4 years ago
MUC is smaller than all or conll in terms of file size, but it has 7 classes, so this may increase the memory footprint when loaded. What happens if you only load the MUC model?
Perhaps there is something silently failing when loading that model in a low memory environment.
On Sun, Oct 27, 2019 at 3:49 PM erksch notifications@github.com wrote:
I (somewhat) successfully integrated CoreNLP in an Android app. The following annotator configuration works just fine:
props.setProperty("annotators", "tokenize,ssplit,pos,lemma")
But as soon as I add the NER annotator I start to get the following error:
Caused by: java.lang.NullPointerException: Attempt to invoke interface method 'int java.util.List.size()' on a null object reference at edu.stanford.nlp.util.HashIndex.size(HashIndex.java:94) at edu.stanford.nlp.ie.crf.CRFClassifier.getCliqueTree(CRFClassifier.java:1499) at edu.stanford.nlp.ie.crf.CRFClassifier.getSequenceModel(CRFClassifier.java:1190) at edu.stanford.nlp.ie.crf.CRFClassifier.getSequenceModel(CRFClassifier.java:1186) at edu.stanford.nlp.ie.crf.CRFClassifier.classifyMaxEnt(CRFClassifier.java:1218) at edu.stanford.nlp.ie.crf.CRFClassifier.classify(CRFClassifier.java:1128) at edu.stanford.nlp.ie.AbstractSequenceClassifier.classifySentence(AbstractSequenceClassifier.java:299) at edu.stanford.nlp.ie.ClassifierCombiner.classify(ClassifierCombiner.java:476) at edu.stanford.nlp.ie.NERClassifierCombiner.classifyWithGlobalInformation(NERClassifierCombiner.java:269) at edu.stanford.nlp.ie.AbstractSequenceClassifier.classifySentenceWithGlobalInformation(AbstractSequenceClassifier.java:343) at edu.stanford.nlp.pipeline.NERCombinerAnnotator.doOneSentence(NERCombinerAnnotator.java:368) at edu.stanford.nlp.pipeline.SentenceAnnotator.annotate(SentenceAnnotator.java:102) at edu.stanford.nlp.pipeline.NERCombinerAnnotator.annotate(NERCombinerAnnotator.java:310) at edu.stanford.nlp.pipeline.AnnotationPipeline.annotate(AnnotationPipeline.java:76) at edu.stanford.nlp.pipeline.StanfordCoreNLP.annotate(StanfordCoreNLP.java:637) at edu.stanford.nlp.pipeline.StanfordCoreNLP.annotate(StanfordCoreNLP.java:629)
The code I use (Kotlin):
val props = Properties() props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner") pipeline = StanfordCoreNLP(props)val document = CoreDocument("Joe Smith is from Seattle.") pipeline.annotate(document)
The error is very similar to the one described in this issue https://github.com/stanfordnlp/CoreNLP/issues/861 where the author tried to use the parser annotator. Debugging
I debugged the stack trace and found that the error is caused by this line (on classIndex.size()) in CRFClassifier:1480 https://github.com/stanfordnlp/CoreNLP/blob/eb43d5d9150de97f8061fa06b838f1d021586789/src/edu/stanford/nlp/ie/crf/CRFClassifier.java#L1480 :
return CRFCliqueTree.getCalibratedCliqueTree(data, labelIndices, classIndex.size(), classIndex, flags.backgroundSymbol, getCliquePotentialFunctionForTest(), featureVal);
Meaning classIndex is null and must have not been initialized properly.
The classIndex property of CRFClassifier is initialized in the loadClassifier(ObjectInputStream ois, Properties props) https://github.com/stanfordnlp/CoreNLP/blob/eb43d5d9150de97f8061fa06b838f1d021586789/src/edu/stanford/nlp/ie/crf/CRFClassifier.java#L2570 method:
public void loadClassifier(ObjectInputStream ois, Properties props) { Object o = ois.readObject(); [...] classIndex = (Index
) ois.readObject(); I found out that the passed ObjectInputStream is effectively a stream on the file from a model path that is determined in the NERCombinerAnnotator constructor:
public NERCombinerAnnotator(Properties properties) throws IOException { List
models = new ArrayList<>(); String modelNames = properties.getProperty("ner.model"); if (modelNames == null) { modelNames = DefaultPaths.DEFAULT_NER_THREECLASS_MODEL + ',' + DefaultPaths.DEFAULT_NER_MUC_MODEL + ',' + DefaultPaths.DEFAULT_NER_CONLL_MODEL; } [...] String[] loadPaths = models.toArray(new String[models.size()]); Those loadPaths are iterated in the loadClassifiers method in ClassifierCombiner:
private void loadClassifiers(Properties props, List
paths) throws IOException { baseClassifiers = new ArrayList<>(); [...] for(String path: paths) { AbstractSequenceClassifier cls = loadClassifierFromPath(props, path); baseClassifiers.add(cls); [...] } When adding a breakpoint to this method I found out that the first model path (DefaultPaths.DEFAULT_NER_THREECLASS_MODEL) in the first iteration of the for-loop is loaded without problems and the classIndex property is set correctly:
[image: Bildschirmfoto vom 2019-10-27 23-39-33] https://user-images.githubusercontent.com/19290349/67643066-2a3f5800-f913-11e9-93a5-d4a9c44b8b02.png
And as you can see at the bottom of the screenshot, the classIndex is initialized properly.
But in the second iteration, when loading from DefaultPaths.DEFAULT_NER_MUC_MODEL, it fails:
[image: Bildschirmfoto vom 2019-10-27 23-43-18] https://user-images.githubusercontent.com/19290349/67643095-9457fd00-f913-11e9-94f0-6bf9e8d1d6c2.png Workaround
My current workaround is to just set the ner model to only the threeclass and conll model:
props.setProperty("ner.model", DefaultPaths.DEFAULT_NER_THREECLASS_MODEL + "," + DefaultPaths.DEFAULT_NER_CONLL_MODEL)
But I actually don't what the consequences are if the MUC model is missing.
My theory is that the MUC model is especially large and thus can not be loaded into memory on a mobile device. Is that true? How big is the model in particular?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/CoreNLP/issues/961?email_source=notifications&email_token=AA2AYWMEMUCFNNYVLOHSL63QQYEGBA5CNFSM4JFTZIOKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HUUI2KA, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWLVAQOH4CD7645B7ATQQYEGBANCNFSM4JFTZIOA .
On my Linux box, it seems that the all3 model needs more memory than the muc7 model. Hopefully the muc model loads correctly when by itself, in which case the problem is most likely low memory.
I tried various lower than expected amounts of memory and couldn't trigger this specific error, though. It was always GC or OOM.
On Sun, Oct 27, 2019 at 5:00 PM John Bauer horatio@gmail.com wrote:
MUC is smaller than all or conll in terms of file size, but it has 7 classes, so this may increase the memory footprint when loaded. What happens if you only load the MUC model?
Perhaps there is something silently failing when loading that model in a low memory environment.
On Sun, Oct 27, 2019 at 3:49 PM erksch notifications@github.com wrote:
I (somewhat) successfully integrated CoreNLP in an Android app. The following annotator configuration works just fine:
props.setProperty("annotators", "tokenize,ssplit,pos,lemma")
But as soon as I add the NER annotator I start to get the following error:
Caused by: java.lang.NullPointerException: Attempt to invoke interface method 'int java.util.List.size()' on a null object reference at edu.stanford.nlp.util.HashIndex.size(HashIndex.java:94) at edu.stanford.nlp.ie.crf.CRFClassifier.getCliqueTree(CRFClassifier.java:1499) at edu.stanford.nlp.ie.crf.CRFClassifier.getSequenceModel(CRFClassifier.java:1190) at edu.stanford.nlp.ie.crf.CRFClassifier.getSequenceModel(CRFClassifier.java:1186) at edu.stanford.nlp.ie.crf.CRFClassifier.classifyMaxEnt(CRFClassifier.java:1218) at edu.stanford.nlp.ie.crf.CRFClassifier.classify(CRFClassifier.java:1128) at edu.stanford.nlp.ie.AbstractSequenceClassifier.classifySentence(AbstractSequenceClassifier.java:299) at edu.stanford.nlp.ie.ClassifierCombiner.classify(ClassifierCombiner.java:476) at edu.stanford.nlp.ie.NERClassifierCombiner.classifyWithGlobalInformation(NERClassifierCombiner.java:269) at edu.stanford.nlp.ie.AbstractSequenceClassifier.classifySentenceWithGlobalInformation(AbstractSequenceClassifier.java:343) at edu.stanford.nlp.pipeline.NERCombinerAnnotator.doOneSentence(NERCombinerAnnotator.java:368) at edu.stanford.nlp.pipeline.SentenceAnnotator.annotate(SentenceAnnotator.java:102) at edu.stanford.nlp.pipeline.NERCombinerAnnotator.annotate(NERCombinerAnnotator.java:310) at edu.stanford.nlp.pipeline.AnnotationPipeline.annotate(AnnotationPipeline.java:76) at edu.stanford.nlp.pipeline.StanfordCoreNLP.annotate(StanfordCoreNLP.java:637) at edu.stanford.nlp.pipeline.StanfordCoreNLP.annotate(StanfordCoreNLP.java:629)
The code I use (Kotlin):
val props = Properties() props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner") pipeline = StanfordCoreNLP(props)val document = CoreDocument("Joe Smith is from Seattle.") pipeline.annotate(document)
The error is very similar to the one described in this issue https://github.com/stanfordnlp/CoreNLP/issues/861 where the author tried to use the parser annotator. Debugging
I debugged the stack trace and found that the error is caused by this line (on classIndex.size()) in CRFClassifier:1480 https://github.com/stanfordnlp/CoreNLP/blob/eb43d5d9150de97f8061fa06b838f1d021586789/src/edu/stanford/nlp/ie/crf/CRFClassifier.java#L1480 :
return CRFCliqueTree.getCalibratedCliqueTree(data, labelIndices, classIndex.size(), classIndex, flags.backgroundSymbol, getCliquePotentialFunctionForTest(), featureVal);
Meaning classIndex is null and must have not been initialized properly.
The classIndex property of CRFClassifier is initialized in the loadClassifier(ObjectInputStream ois, Properties props) https://github.com/stanfordnlp/CoreNLP/blob/eb43d5d9150de97f8061fa06b838f1d021586789/src/edu/stanford/nlp/ie/crf/CRFClassifier.java#L2570 method:
public void loadClassifier(ObjectInputStream ois, Properties props) { Object o = ois.readObject(); [...] classIndex = (Index
) ois.readObject(); I found out that the passed ObjectInputStream is effectively a stream on the file from a model path that is determined in the NERCombinerAnnotator constructor:
public NERCombinerAnnotator(Properties properties) throws IOException { List
models = new ArrayList<>(); String modelNames = properties.getProperty("ner.model"); if (modelNames == null) { modelNames = DefaultPaths.DEFAULT_NER_THREECLASS_MODEL + ',' + DefaultPaths.DEFAULT_NER_MUC_MODEL + ',' + DefaultPaths.DEFAULT_NER_CONLL_MODEL; } [...] String[] loadPaths = models.toArray(new String[models.size()]); Those loadPaths are iterated in the loadClassifiers method in ClassifierCombiner:
private void loadClassifiers(Properties props, List
paths) throws IOException { baseClassifiers = new ArrayList<>(); [...] for(String path: paths) { AbstractSequenceClassifier cls = loadClassifierFromPath(props, path); baseClassifiers.add(cls); [...] } When adding a breakpoint to this method I found out that the first model path (DefaultPaths.DEFAULT_NER_THREECLASS_MODEL) in the first iteration of the for-loop is loaded without problems and the classIndex property is set correctly:
[image: Bildschirmfoto vom 2019-10-27 23-39-33] https://user-images.githubusercontent.com/19290349/67643066-2a3f5800-f913-11e9-93a5-d4a9c44b8b02.png
And as you can see at the bottom of the screenshot, the classIndex is initialized properly.
But in the second iteration, when loading from DefaultPaths.DEFAULT_NER_MUC_MODEL, it fails:
[image: Bildschirmfoto vom 2019-10-27 23-43-18] https://user-images.githubusercontent.com/19290349/67643095-9457fd00-f913-11e9-94f0-6bf9e8d1d6c2.png Workaround
My current workaround is to just set the ner model to only the threeclass and conll model:
props.setProperty("ner.model", DefaultPaths.DEFAULT_NER_THREECLASS_MODEL + "," + DefaultPaths.DEFAULT_NER_CONLL_MODEL)
But I actually don't what the consequences are if the MUC model is missing.
My theory is that the MUC model is especially large and thus can not be loaded into memory on a mobile device. Is that true? How big is the model in particular?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/CoreNLP/issues/961?email_source=notifications&email_token=AA2AYWMEMUCFNNYVLOHSL63QQYEGBA5CNFSM4JFTZIOKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HUUI2KA, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWLVAQOH4CD7645B7ATQQYEGBANCNFSM4JFTZIOA .
When applying the MUC only I get the same as above. Debugging the baseClassifier array it looks like this: As you can see the same errors as above.
If the model is smaller then maybe it's not due to memory... Is there something fundamentally different between MUC and THREECLASS, CONLL? By the way, I use the latest models from the maven repository.
PS: I now use my custom NER models anyway and it works like a charm without any problems! Thank you very much for this wonderful library and for enabling me to accomplish offline NLP & NER for Android.
Here's a random question. Are you certain the initialization is complete at this point? I've found that in certain cases Java can spend a long time trying to GC the model loading process when it's just barely out of memory. I don't know of any reason why you would get a null pointer here without having gotten a GC or OOM exception first, but perhaps the issue is it hasn't given up and died by the time you try to query it.
Frankly the odds of us being able to diagnose it are pretty low considering none of us do Android development. As I said in the parser issue, changing the memory requirements locally doesn't seem to get this particular bug. I just always get a GC or OOM exception. Without being able to reproduce it locally, I'm not sure how we can fix it.
Actually, are you a licensee? I'm not a lawyer, but it sounds like you're developing an app where you want to redistribute corenlp, and if that's a proprietary or paid app, that may not agree with our license restrictions.
https://stanfordnlp.github.io/CoreNLP/#license
If you were a licensee, that might give us some budget to invest time & effort in debugging android specific issues. Hint, hint.
On Mon, Oct 28, 2019 at 9:45 AM erksch notifications@github.com wrote:
When applying the MUC only I get the same as above. Debugging the baseClassifier array it looks like this: [image: Bildschirmfoto vom 2019-10-28 17-41-52] https://user-images.githubusercontent.com/19290349/67698229-45a97200-f9aa-11e9-8fca-4d0e001abfde.png As you can see the same errors as above.
If the model is smaller then maybe it's not due to memory... Is there something fundamentally different between MUC and THREECLASS, CONLL?
PS: I now use my custom NER models anyway and it works like a charm without any problems! Thank you very much for this wonderful library and for enabling me to accomplish offline NLP & NER for Android.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/CoreNLP/issues/961?email_source=notifications&email_token=AA2AYWII2JZONEEDWKQDGMDQQ4JLXA5CNFSM4JFTZIOKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECNSIFY#issuecomment-547038231, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWOR4NTYO3JZLFPLSULQQ4JLXANCNFSM4JFTZIOA .
@AngledLuffa Thanks for the hint with the license. I am not a licensee (yet) but will get in contact once everything works as expected.
But remember that due to the required minSDKVersion of 26 (8% of devices) using CoreNLP for B2C Android apps is not really an option. Maybe if this would work for more devices you would license more software hint hint ;)
Java 8 is 5.5 years old at this point, and it has a lot of features which really make programming easier. Considering we're usually developing for newer, beefier machines, it would be very limiting to hold ourselves back in terms of java version.
I understand the desire to not need a network connection to use the models, but you might need to consider it if only 8% of devices can run the models locally.
Having said that, I really don't understand how this particular error is happening. Was there any luck investigating if the initialization is complete? Like I mentioned, it doesn't make a lot of sense for everything to be null like that without some sort of error showing up while it was initializing.
On Tue, Oct 29, 2019 at 1:23 AM erksch notifications@github.com wrote:
@AngledLuffa https://github.com/AngledLuffa Thanks for the hint with the license. I am not a licensee (yet) but will get in contact once everything works as expected.
But remember that due to the required minSDKVersion of 26 (8% of devices) using CoreNLP for B2C Android apps is not really an option. Maybe if this would work for more devices you would license more software hint hint ;)
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/CoreNLP/issues/961?email_source=notifications&email_token=AA2AYWPVAYKUU7FNXF43OP3QQ7XH5A5CNFSM4JFTZIOKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECPUEFA#issuecomment-547308052, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWPNJZRAECLRORW7HH3QQ7XH5ANCNFSM4JFTZIOA .
@erksch congratulations on being the first person I've seen in 4+ years to report running NER locally on an Android phone!
As an update, CoreNLP 4.0.0 uses less memory for NER than previous versions, and there have been even more optimizations in the master branch since the 4.0.0 release. Do you have any interest in retrying the MUC model?
I (somewhat) successfully integrated CoreNLP (3.9.2) in an Android app. The following annotator configuration works just fine:
But as soon as I add the NER annotator I start to get the following error:
The code I use (Kotlin):
The error is very similar to the one described in this issue where the author tried to use the parser annotator.
Debugging
I debugged the stack trace and found that the error is caused by this line (on
classIndex.size()
) inCRFClassifier:1480
:Meaning
classIndex
is null and was not initialized properly.The
classIndex
property ofCRFClassifier
is initialized in theloadClassifier(ObjectInputStream ois, Properties props)
method:I found out that the passed ObjectInputStream is effectively a stream on the file from a model path that is determined in the
NERCombinerAnnotator
constructor:Those
loadPaths
are iterated in theloadClassifiers
method inClassifierCombiner
:By adding a breakpoint to this method I found out that the first model path (
DefaultPaths.DEFAULT_NER_THREECLASS_MODEL
) in the first iteration of the for-loop is loaded without problems and theclassIndex
property is set correctly:But in the second iteration, when loading from
DefaultPaths.DEFAULT_NER_MUC_MODEL
, it fails:Workaround
My current workaround is to just set the ner model to only the threeclass and conll model:
But I actually don't what the consequences are if the MUC model is missing.
Explanation
My theory is that the MUC model is especially large and thus can not be loaded into memory on a mobile device. Is that true? How big is the model in particular? But when I monitor the memory consumption of the app I don't spot anything critical, the app stays under 512 MB before it crashes.