Closed claeyzre closed 2 years ago
I also enounter this with CoreNLP 3.8.0. I have no experience with older versions.
A concrete reproducable sample from zhwiki which causes a IndexOutOfBoundsException exception: https://gist.github.com/marcusklang/982a1009db54510f2ac072b0da1dece2
This sample also crashes corenlp.run.
The concrete backtrace within CoreNLP:
java.lang.IndexOutOfBoundsException: Index: 8565, Size: 8565
at java.util.ArrayList.rangeCheck(ArrayList.java:653)
at java.util.ArrayList.get(ArrayList.java:429)
at edu.stanford.nlp.pipeline.ChineseSegmenterAnnotator.runSegmentation(ChineseSegmenterAnnotator.java:260)
at edu.stanford.nlp.pipeline.ChineseSegmenterAnnotator.doOneSentence(ChineseSegmenterAnnotator.java:124)
at edu.stanford.nlp.pipeline.ChineseSegmenterAnnotator.annotate(ChineseSegmenterAnnotator.java:118)
at edu.stanford.nlp.pipeline.TokenizerAnnotator.annotate(TokenizerAnnotator.java:309)
at edu.stanford.nlp.pipeline.AnnotationPipeline.annotate(AnnotationPipeline.java:76)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.annotate(StanfordCoreNLP.java:599)
It seems that instead of outputing a warning, if the pipeline encounters emojis or weird characters, it ignores it in the size calculation of the string. Later on when the length is checked with the original sentence the IndexArrayOutOfBoundsException occurs.
It seems that a few characters are crashing the system, including the corenlp.run server. The symbol 𤈦 is an example of such characters.
A followup, all the characters in the CJK_Unified_Ideographs_Extension_B block seem to crash coreNLP, when set to Chinese. See the list here: http://demo.icu-project.org/icu-bin/ubrowse?scr=94&b=24226
I'm encountering the same bug in v3.9.2. The issue is in ChineseSegmentAnnotator/advancePos
, which attempts to relate the index of a token from the segmenter to a character index from ChineseSegmentAnnotator/splitCharacters
by incrementing an offset until the words match:
// w is the word from the segmenter
// sentChars is the array from splitCharacters
private static int advancePos(List<CoreLabel> sentChars, int pos, String w) {
StringBuilder sb = new StringBuilder();
while ( ! w.equals(sb.toString())) {
sb.append(sentChars.get(pos).get(CoreAnnotations.ChineseCharAnnotation.class));
pos++;
}
return pos;
}
This fails on emoji because the segmenter outputs ?
for unknowns. This makes it so the while
condition never matches, and you quickly get an IndexOutOfBoundsException.
I think it's tricky to fix because you have to look ahead in the list of tokens from the segmenter and the array from splitChars until you find the next word they agree on, which might not be straightforward.
I did this in our current codebase:
java edu.stanford.nlp.pipeline.StanfordCoreNLP -props StanfordCoreNLP-chinese.properties -file bar.txt -outputFormat text
input file a few characters from the linked page:
𤈠𤈡𤈢
It did not barf as I would expect it to from this bug report. Either this has been fixed already, or I need more information to go on. Please either send a (small!) file which reproduces the error, or confirm that the current github does not have this error. Thanks!
On Fri, Aug 9, 2019 at 8:58 AM Chris Bowdon notifications@github.com wrote:
I'm encountering the same bug in v3.9.2. The issue is in ChineseSegmentAnnotator/advancePos, which attempts to relate the index of a token from the segmenter to a character index from ChineseSegmentAnnotator/splitCharacters by incrementing an offset until the words match:
// w is the word from the segmenter // sentChars is the array from splitCharacters private static int advancePos(List
sentChars, int pos, String w) { StringBuilder sb = new StringBuilder(); while ( ! w.equals(sb.toString())) { sb.append(sentChars.get(pos).get(CoreAnnotations.ChineseCharAnnotation.class)); pos++; } return pos; } This fails on emoji because the segmenter outputs ? for unknowns. This makes it so the while condition never matches, and you quickly get an IndexOutOfBoundsException.
I think it's tricky to fix because you have to look ahead in the list of tokens from the segmenter and the array from splitChars until you find the next word they agree on, which might not be straightforward.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/CoreNLP/issues/490?email_source=notifications&email_token=AA2AYWKUWVCIAHOKLXWNBV3QDWHZ5A5CNFSM4DUQWE72YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD37CNTA#issuecomment-519972556, or mute the thread https://github.com/notifications/unsubscribe-auth/AA2AYWIVQBFS4FUYYDCYRRLQDWHZ5ANCNFSM4DUQWE7Q .
This is still happening for me in 3.9.2 Only with the Chinese tagger. English is OK.
Error is: stanfordnlp.server.client.AnnotationException: java.util.concurrent.ExecutionException: java.lang.IndexOutOfBoundsException: Index 8 out of bounds for length 8
Example text follows
没有什么可比性😺
Thank you, this is a very clear example. I should have it fixed soon
On Tue, Sep 17, 2019 at 5:56 PM Simon Blanchard notifications@github.com wrote:
This is still happening for me in 3.9.2 Only with the Chinese tagger. English is OK.
Error is: stanfordnlp.server.client.AnnotationException: java.util.concurrent.ExecutionException: java.lang.IndexOutOfBoundsException: Index 8 out of bounds for length 8
Example text follows
没有什么可比性😺
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/CoreNLP/issues/490?email_source=notifications&email_token=AA2AYWLR7M6DVIGI4QPM5ADQKF4ETA5CNFSM4DUQWE72YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD66OJWQ#issuecomment-532473050, or mute the thread https://github.com/notifications/unsubscribe-auth/AA2AYWJ7SBXWSV73I5ARZT3QKF4ETANCNFSM4DUQWE7Q .
Should be fixed with change 361a56cc93f479969078940cb3f49d52afebb355
On Wed, Sep 18, 2019 at 7:42 PM John Bauer horatio@gmail.com wrote:
Thank you, this is a very clear example. I should have it fixed soon
On Tue, Sep 17, 2019 at 5:56 PM Simon Blanchard notifications@github.com wrote:
This is still happening for me in 3.9.2 Only with the Chinese tagger. English is OK.
Error is: stanfordnlp.server.client.AnnotationException: java.util.concurrent.ExecutionException: java.lang.IndexOutOfBoundsException: Index 8 out of bounds for length 8
Example text follows
没有什么可比性😺
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/CoreNLP/issues/490?email_source=notifications&email_token=AA2AYWLR7M6DVIGI4QPM5ADQKF4ETA5CNFSM4DUQWE72YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD66OJWQ#issuecomment-532473050, or mute the thread https://github.com/notifications/unsubscribe-auth/AA2AYWJ7SBXWSV73I5ARZT3QKF4ETANCNFSM4DUQWE7Q .
(also, the txt file pasted above now works)
Hello all,
The segmentation pipeline seems to output more IndexOutOfBoundsException in latest 3.8.0 version. I got an IndexOutOfBoundsException on some sentences, especially the ones with weird chars like emojis but those used to work in 3.7.0. Here is the stack :
java.lang.RuntimeException: java.lang.IndexOutOfBoundsException: Index: 220, Size: 220 at org.apache.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:466) at org.apache.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:432) at org.apache.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:73) at org.apache.storm.daemon.executor$fn__4973$fn__4986$fn__5039.invoke(executor.clj:846) at org.apache.storm.util$async_loop$fn__557.invoke(util.clj:484) at clojure.lang.AFn.run(AFn.java:22) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.IndexOutOfBoundsException: Index: 220, Size: 220 at java.util.ArrayList.rangeCheck(ArrayList.java:653) at java.util.ArrayList.get(ArrayList.java:429) at edu.stanford.nlp.pipeline.ChineseSegmenterAnnotator.runSegmentation(ChineseSegmenterAnnotator.java:306) at edu.stanford.nlp.pipeline.ChineseSegmenterAnnotator.doOneSentence(ChineseSegmenterAnnotator.java:124) at edu.stanford.nlp.pipeline.ChineseSegmenterAnnotator.annotate(ChineseSegmenterAnnotator.java:118) at edu.stanford.nlp.pipeline.AnnotationPipeline.annotate(AnnotationPipeline.java:76) at edu.stanford.nlp.pipeline.StanfordCoreNLP.annotate(StanfordCoreNLP.java:599)
It happens during annotation of sentences with a pipeline with this conf :
"annotators" "segment, ssplit, ner" "customAnnotatorClass.segment" "edu.stanford.nlp.pipeline.ChineseSegmenterAnnotator" "segment.model" "xx/stanford-chinese-corenlp-2017-06-09-models/edu/stanford/nlp/models/segmenter/chinese/ctb.gz" "segment.sighanCorporaDict" "xx/stanford-chinese-corenlp-2017-06-09-models/edu/stanford/nlp/models/segmenter/chinese" "segment.serDictionary" "xx/stanford-chinese-corenlp-2017-06-09-models/edu/stanford/nlp/models/segmenter/chinese/dict-chris6.ser.gz" "segment.sighanPostProcessing" "true" "ssplit.boundaryTokenRegex" "[.]|[!?]+|[。]|[!?]+" "ner.model" "xx/stanford-chinese-corenlp-2017-06-09-models/edu/stanford/nlp/models/ner/chinese.misc.distsim.crf.ser.gz" "ner.applyNumericClassifiers" "false" "ner.useSUTime" "false"
In 3.7.0 I usually had warnings like this one
ChineseUtils.normalize warning: private use area codepoint U+e310 in 每一个认真
and few IndexOutOfBoundsException (in my test dump : ~20 Exception over 500 000 sentences), in 3.8.0 it's more ~100 000Exceptions over the same 500 000 sentences.It might be an improvement as these sentence cannot definitely be segmented, and an error might be more significant than a warning.
Any advice ?
Thanks.