UimaDocumentIterator.java, nextDocument() may try to get next document and sentence when resource does not exist

Issue Description

The first line of UimaDocumentIterator nextDocument function (https://github.com/pzalex/deeplearning4j/blob/uimafix/deeplearning4j-nlp-parent/deeplearning4j-nlp-uima/src/main/java/org/deeplearning4j/text/documentiterator/UimaDocumentIterator.java), that checks if resource is available may not have been constructed properly: if (this.sentences == null || !this.sentences.hasNext() && this.reader.hasNext()) { // get the text data; }

the boolean AND that has higher precedence is executed first, so the statement above is equivalent to if [ !this.sentences.hasNext() && this.reader.hasNext()) ] || this.sentences == null (square brackets not for java code). Reads as IF there is no next sentence AND there is a file to read, OR sentences is null, execute the rest. If 'sentences' is null, that would probably generate an exception, trying to get next sentence in the AND clause. Having "sentces == null" in the not parenthesized OR clause, means the following statements trying to retrieve text regardless if file exists or not, what can throw exception.

Statement in hasNext() seem to be correct return this.reader.hasNext() || this.sentences != null && this.sentences.hasNext();

In plain English, return true if there is a document to read, or if there are sentences with a sentence left

At runtime

If hasNext() has not been checked, getNext() can throw exception

Next steps

Get DL4j NLP-UIMA source code into the local clearClinical installation to fix. Commit and push to DL4j uimafix branch if fixed.

Version Information

Deeplearning4j uimafix branch
CentOS

Contributing

If you'd like to help us fix the issue by contributing some code, but would like guidance or help in doing so, please mention it!

Can you post a stack trace of this? I'm confused about which client program is creating this error and whether the caller is unhappy getting a blank sentence "document".

I'm also puzzled as to why this error is occurring now in our update and what weird situation yields no document to read AND an empty sentence iterator. Theoretically the caller should be checking hasNext() before calling this... My theory is that it is unhappy returning a blank line and there is some null checking involved upstream, returning null instead of a blank line may be a better bet.

You can also modify it so that there is a parentheses around this.sentences == null || !this.sentences.hasNext() in nextDocument(). Also, return instead of an emtpy string and change the error message.

Here is a trace with UimaDocIterator throwing exception last: This is not the exception I was after when I was examining the file to begin with. I was tracing the NPE that is posted on a issue page in clearClinical repo. Will update soon

06 Dec 2017 13:49:02 WARN UimaDocumentIterator - Error reading input stream...this is just a warning..Going to return org.apache.uima.analysis_engine.AnalysisEngineProcessException at org.apache.ctakes.contexttokenizer.ae.ContextDependentTokenizerAnnotator.process(ContextDependentTokenizerAnnotator.java:105) at org.apache.uima.analysis_component.JCasAnnotator_ImplBase.process(JCasAnnotator_ImplBase.java:48) at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:396) at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.processAndOutputNewCASes(PrimitiveAnalysisEngine_impl.java:314) at org.apache.uima.analysis_engine.asb.impl.ASB_impl$AggregateCasIterator.processUntilNextOutputCas(ASB_impl.java:570) at org.apache.uima.analysis_engine.asb.impl.ASB_impl$AggregateCasIterator.(ASB_impl.java:412) at org.apache.uima.analysis_engine.asb.impl.ASB_impl.process(ASB_impl.java:344) at org.apache.uima.analysis_engine.impl.AggregateAnalysisEngine_impl.processAndOutputNewCASes(AggregateAnalysisEngine_impl.java:265) at org.apache.uima.analysis_engine.impl.AnalysisEngineImplBase.process(AnalysisEngineImplBase.java:269) at org.apache.uima.analysis_engine.impl.AnalysisEngineImplBase.process(AnalysisEngineImplBase.java:284) at org.deeplearning4j.text.documentiterator.UimaDocumentIterator.nextDocument(UimaDocumentIterator.java:74) at org.deeplearning4j.text.sentenceiterator.StreamLineIterator.nextSentence(StreamLineIterator.java:63) at org.deeplearning4j.text.sentenceiterator.interoperability.SentenceIteratorConverter.nextDocument(SentenceIteratorConverter.java:43) at org.deeplearning4j.text.documentiterator.BasicLabelAwareIterator.nextDocument(BasicLabelAwareIterator.java:42) at org.deeplearning4j.text.documentiterator.BasicLabelAwareIterator.next(BasicLabelAwareIterator.java:68) at org.deeplearning4j.text.documentiterator.BasicLabelAwareIterator.next(BasicLabelAwareIterator.java:17) at org.deeplearning4j.parallelism.AsyncIterator$ReaderThread.run(AsyncIterator.java:102) Caused by: org.apache.uima.analysis_engine.AnalysisEngineProcessException at org.apache.ctakes.contexttokenizer.ae.ContextDependentTokenizerAnnotator.executeFSMs(ContextDependentTokenizerAnnotator.java:167) at org.apache.ctakes.contexttokenizer.ae.ContextDependentTokenizerAnnotator.process(ContextDependentTokenizerAnnotator.java:102) ... 16 more Caused by: java.lang.ArrayIndexOutOfBoundsException Dec 06, 2017 1:49:02 PM org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl callAnalysisComponentProcess(430)

After patching the UimaDocumentIterator.java by parenthesizing (sentence is null OR there is no next sentence) AND the model got built for the mimic files in directory semeval-2015-task-14_updated/semeval-2014-unlabeled-mimic-notes.v1/ecg/17/ with 2400 files. That directory was always giving troubles, and the model has never completed with that directory and the original UimaDocIterator.java.

08 Dec 2017 11:18:22 INFO SequenceVectors - Epoch [1] finished; Elements processed so far: [598185]; Sequences processed: [40059] 08 Dec 2017 11:18:22 INFO SequenceVectors - Time spent on training: 19378 ms Similarity between abdomen and pain NaN [] 08 Dec 2017 11:18:24 INFO WordVectorSerializer - Word2Vec conf. JSON: {"allowParallelTokenization":true,"batchSize":1000,"elementsLearningAlgorithm":null,"epochs":1,"hugeModelExpected":false,"iterations":3,"layersSize":300,"learningRate":0.025,"learningRateDecayWords":0,"minLearningRate":0.01,"minWordFrequency":5,"modelUtils":"org.deeplearning4j.models.embeddings.reader.impl.BasicModelUtils","negative":10.0,"ngram":0,"preciseWeightInit":false,"sampling":1.0E-5,"scavengerActivationThreshold":2000000,"scavengerRetentionDelay":3,"seed":0,"sequenceLearningAlgorithm":null,"stop":"STOP","stopList":[],"tokenPreProcessor":null,"tokenizerFactory":"org.deeplearning4j.text.tokenization.tokenizerfactory.UimaTokenizerFactory","trainElementsVectors":true,"trainSequenceVectors":true,"unk":"UNK","useAdaGrad":false,"useHierarchicSoftmax":true,"useUnknown":false,"variableWindows":null,"vocabSize":996,"window":5} 08 Dec 2017 11:18:25 INFO InMemoryLookupTable - Initializing syn1... Test /home/azotov/dl4j0.9.2.snap_mimic_word2vec_model_word2vecSerialer_ecg_17:[]

Process finished with exit code 0

It's not a good model if we can't get any similarity between abdomen and pain. Is it never doing any tokenization? I'm sure abdomen and pain should be in there, they are definitely in there for discharge notes.

pzalex / deeplearning4j