Chinese Segmentation Pipeline IndexOutOfBoundsException

claeyzre commented 7 years ago

Hello all,

The segmentation pipeline seems to output more IndexOutOfBoundsException in latest 3.8.0 version. I got an IndexOutOfBoundsException on some sentences, especially the ones with weird chars like emojis but those used to work in 3.7.0. Here is the stack :

java.lang.RuntimeException: java.lang.IndexOutOfBoundsException: Index: 220, Size: 220 at org.apache.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:466) at org.apache.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:432) at org.apache.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:73) at org.apache.storm.daemon.executor$fn__4973$fn__4986$fn__5039.invoke(executor.clj:846) at org.apache.storm.util$async_loop$fn__557.invoke(util.clj:484) at clojure.lang.AFn.run(AFn.java:22) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.IndexOutOfBoundsException: Index: 220, Size: 220 at java.util.ArrayList.rangeCheck(ArrayList.java:653) at java.util.ArrayList.get(ArrayList.java:429) at edu.stanford.nlp.pipeline.ChineseSegmenterAnnotator.runSegmentation(ChineseSegmenterAnnotator.java:306) at edu.stanford.nlp.pipeline.ChineseSegmenterAnnotator.doOneSentence(ChineseSegmenterAnnotator.java:124) at edu.stanford.nlp.pipeline.ChineseSegmenterAnnotator.annotate(ChineseSegmenterAnnotator.java:118) at edu.stanford.nlp.pipeline.AnnotationPipeline.annotate(AnnotationPipeline.java:76) at edu.stanford.nlp.pipeline.StanfordCoreNLP.annotate(StanfordCoreNLP.java:599)

It happens during annotation of sentences with a pipeline with this conf :

"annotators" "segment, ssplit, ner" "customAnnotatorClass.segment" "edu.stanford.nlp.pipeline.ChineseSegmenterAnnotator" "segment.model" "xx/stanford-chinese-corenlp-2017-06-09-models/edu/stanford/nlp/models/segmenter/chinese/ctb.gz" "segment.sighanCorporaDict" "xx/stanford-chinese-corenlp-2017-06-09-models/edu/stanford/nlp/models/segmenter/chinese" "segment.serDictionary" "xx/stanford-chinese-corenlp-2017-06-09-models/edu/stanford/nlp/models/segmenter/chinese/dict-chris6.ser.gz" "segment.sighanPostProcessing" "true" "ssplit.boundaryTokenRegex" "[.]|[!?]+|[。]|[！？]+" "ner.model" "xx/stanford-chinese-corenlp-2017-06-09-models/edu/stanford/nlp/models/ner/chinese.misc.distsim.crf.ser.gz" "ner.applyNumericClassifiers" "false" "ner.useSUTime" "false"

In 3.7.0 I usually had warnings like this one ChineseUtils.normalize warning: private use area codepoint U+e310 in 每一个认真 and few IndexOutOfBoundsException (in my test dump : ~20 Exception over 500 000 sentences), in 3.8.0 it's more ~100 000Exceptions over the same 500 000 sentences.

It might be an improvement as these sentence cannot definitely be segmented, and an error might be more significant than a warning.

Any advice ?

Thanks.

marcusklang commented 7 years ago

I also enounter this with CoreNLP 3.8.0. I have no experience with older versions.

A concrete reproducable sample from zhwiki which causes a IndexOutOfBoundsException exception: https://gist.github.com/marcusklang/982a1009db54510f2ac072b0da1dece2

This sample also crashes corenlp.run.

The concrete backtrace within CoreNLP:

java.lang.IndexOutOfBoundsException: Index: 8565, Size: 8565
    at java.util.ArrayList.rangeCheck(ArrayList.java:653)
    at java.util.ArrayList.get(ArrayList.java:429)
    at edu.stanford.nlp.pipeline.ChineseSegmenterAnnotator.runSegmentation(ChineseSegmenterAnnotator.java:260)
    at edu.stanford.nlp.pipeline.ChineseSegmenterAnnotator.doOneSentence(ChineseSegmenterAnnotator.java:124)
    at edu.stanford.nlp.pipeline.ChineseSegmenterAnnotator.annotate(ChineseSegmenterAnnotator.java:118)
    at edu.stanford.nlp.pipeline.TokenizerAnnotator.annotate(TokenizerAnnotator.java:309)
    at edu.stanford.nlp.pipeline.AnnotationPipeline.annotate(AnnotationPipeline.java:76)
    at edu.stanford.nlp.pipeline.StanfordCoreNLP.annotate(StanfordCoreNLP.java:599)

claeyzre commented 7 years ago

It seems that instead of outputing a warning, if the pipeline encounters emojis or weird characters, it ignores it in the size calculation of the string. Later on when the length is checked with the original sentence the IndexArrayOutOfBoundsException occurs.

pnugues commented 7 years ago

It seems that a few characters are crashing the system, including the corenlp.run server. The symbol 𤈦 is an example of such characters.

pnugues commented 7 years ago

A followup, all the characters in the CJK_Unified_Ideographs_Extension_B block seem to crash coreNLP, when set to Chinese. See the list here: http://demo.icu-project.org/icu-bin/ubrowse?scr=94&b=24226

cbowdon commented 5 years ago

I'm encountering the same bug in v3.9.2. The issue is in ChineseSegmentAnnotator/advancePos, which attempts to relate the index of a token from the segmenter to a character index from ChineseSegmentAnnotator/splitCharacters by incrementing an offset until the words match:

 // w is the word from the segmenter
 // sentChars is the array from splitCharacters
  private static int advancePos(List<CoreLabel> sentChars, int pos, String w) {
    StringBuilder sb = new StringBuilder();
    while ( ! w.equals(sb.toString())) {
      sb.append(sentChars.get(pos).get(CoreAnnotations.ChineseCharAnnotation.class));
      pos++;
    }
    return pos;
  }

This fails on emoji because the segmenter outputs ? for unknowns. This makes it so the while condition never matches, and you quickly get an IndexOutOfBoundsException.

I think it's tricky to fix because you have to look ahead in the list of tokens from the segmenter and the array from splitChars until you find the next word they agree on, which might not be straightforward.

AngledLuffa commented 5 years ago

I did this in our current codebase:

java edu.stanford.nlp.pipeline.StanfordCoreNLP -props StanfordCoreNLP-chinese.properties -file bar.txt -outputFormat text

input file a few characters from the linked page:

𤈠𤈡𤈢

It did not barf as I would expect it to from this bug report. Either this has been fixed already, or I need more information to go on. Please either send a (small!) file which reproduces the error, or confirm that the current github does not have this error. Thanks!

On Fri, Aug 9, 2019 at 8:58 AM Chris Bowdon notifications@github.com wrote:

I'm encountering the same bug in v3.9.2. The issue is in ChineseSegmentAnnotator/advancePos, which attempts to relate the index of a token from the segmenter to a character index from ChineseSegmentAnnotator/splitCharacters by incrementing an offset until the words match:

// w is the word from the segmenter // sentChars is the array from splitCharacters private static int advancePos(List sentChars, int pos, String w) { StringBuilder sb = new StringBuilder(); while ( ! w.equals(sb.toString())) { sb.append(sentChars.get(pos).get(CoreAnnotations.ChineseCharAnnotation.class)); pos++; } return pos; }

This fails on emoji because the segmenter outputs ? for unknowns. This makes it so the while condition never matches, and you quickly get an IndexOutOfBoundsException.

I think it's tricky to fix because you have to look ahead in the list of tokens from the segmenter and the array from splitChars until you find the next word they agree on, which might not be straightforward.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/CoreNLP/issues/490?email_source=notifications&email_token=AA2AYWKUWVCIAHOKLXWNBV3QDWHZ5A5CNFSM4DUQWE72YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD37CNTA#issuecomment-519972556, or mute the thread https://github.com/notifications/unsubscribe-auth/AA2AYWIVQBFS4FUYYDCYRRLQDWHZ5ANCNFSM4DUQWE7Q .

bnomis commented 5 years ago

This is still happening for me in 3.9.2 Only with the Chinese tagger. English is OK.

Error is: stanfordnlp.server.client.AnnotationException: java.util.concurrent.ExecutionException: java.lang.IndexOutOfBoundsException: Index 8 out of bounds for length 8

Example text follows

没有什么可比性😺

AngledLuffa commented 5 years ago

Thank you, this is a very clear example. I should have it fixed soon

On Tue, Sep 17, 2019 at 5:56 PM Simon Blanchard notifications@github.com wrote:

This is still happening for me in 3.9.2 Only with the Chinese tagger. English is OK.

Error is: stanfordnlp.server.client.AnnotationException: java.util.concurrent.ExecutionException: java.lang.IndexOutOfBoundsException: Index 8 out of bounds for length 8

Example text follows

没有什么可比性😺

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/CoreNLP/issues/490?email_source=notifications&email_token=AA2AYWLR7M6DVIGI4QPM5ADQKF4ETA5CNFSM4DUQWE72YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD66OJWQ#issuecomment-532473050, or mute the thread https://github.com/notifications/unsubscribe-auth/AA2AYWJ7SBXWSV73I5ARZT3QKF4ETANCNFSM4DUQWE7Q .

AngledLuffa commented 5 years ago

Should be fixed with change 361a56cc93f479969078940cb3f49d52afebb355

On Wed, Sep 18, 2019 at 7:42 PM John Bauer horatio@gmail.com wrote:

Thank you, this is a very clear example. I should have it fixed soon

On Tue, Sep 17, 2019 at 5:56 PM Simon Blanchard notifications@github.com wrote:

This is still happening for me in 3.9.2 Only with the Chinese tagger. English is OK.

Error is: stanfordnlp.server.client.AnnotationException: java.util.concurrent.ExecutionException: java.lang.IndexOutOfBoundsException: Index 8 out of bounds for length 8

Example text follows

没有什么可比性😺

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/CoreNLP/issues/490?email_source=notifications&email_token=AA2AYWLR7M6DVIGI4QPM5ADQKF4ETA5CNFSM4DUQWE72YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD66OJWQ#issuecomment-532473050, or mute the thread https://github.com/notifications/unsubscribe-auth/AA2AYWJ7SBXWSV73I5ARZT3QKF4ETANCNFSM4DUQWE7Q .

AngledLuffa commented 2 years ago

(also, the txt file pasted above now works)

stanfordnlp / CoreNLP

Chinese Segmentation Pipeline IndexOutOfBoundsException #490