ClearNlpSegmenter chokes on character: ’

GoogleCodeExporter commented 9 years ago

Having a backtick in a string yields an exception in ClearNlpSegmenter.

I guess this should also be reported in clearnlp, but I wanted to document this 
here too so we can adapt the wrapper in case this isn't resolved at the source.

Original issue reported on code.google.com by torsten....@gmail.com on 7 Aug 2014 at 2:03

GoogleCodeExporter commented 9 years ago

Correction: This is actually not a backtick, but a special quote character.

Original comment by torsten....@gmail.com on 7 Aug 2014 at 2:05

GoogleCodeExporter commented 9 years ago

Addendum: the stack trace indicates that this might actually be a problem of 
the segmenterbase, not clearnlp

Caused by: java.lang.StringIndexOutOfBoundsException: String index out of 
range: -1
    at java.lang.String.charAt(String.java:658)
    at de.tudarmstadt.ukp.dkpro.core.api.segmentation.SegmenterBase.trim(SegmenterBase.java:262)
    at de.tudarmstadt.ukp.dkpro.core.api.segmentation.SegmenterBase.createToken(SegmenterBase.java:239)
    at de.tudarmstadt.ukp.dkpro.core.api.segmentation.SegmenterBase.createToken(SegmenterBase.java:228)
    at de.tudarmstadt.ukp.dkpro.core.clearnlp.ClearNlpSegmenter.process(ClearNlpSegmenter.java:102)
    at de.tudarmstadt.ukp.dkpro.core.api.segmentation.SegmenterBase.process(SegmenterBase.java:124)
    at de.tudarmstadt.ukp.dkpro.core.clearnlp.ClearNlpSegmenter.process(ClearNlpSegmenter.java:74)
    at org.apache.uima.analysis_component.JCasAnnotator_ImplBase.process(JCasAnnotator_ImplBase.java:48)
    at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:385)

Original comment by torsten....@gmail.com on 7 Aug 2014 at 2:11

GoogleCodeExporter commented 9 years ago

The problem is in line 95 of ClearNlpSegmenter
tBegin = aText.indexOf(token, tBegin);
This yields -1 as the call to the underlying segmenter returns a string where 
’ has been changed to ' so the character cannot be found.

Anyone ideas how we could catch that in the wrapper?

Original comment by torsten....@gmail.com on 7 Aug 2014 at 2:20

GoogleCodeExporter commented 9 years ago

The segmenter must not change the text ;)

Original comment by richard.eckart on 7 Aug 2014 at 2:21

GoogleCodeExporter commented 9 years ago

So I will file an issue upstream.

Original comment by torsten....@gmail.com on 7 Aug 2014 at 2:23

GoogleCodeExporter commented 9 years ago

Actually I don't understand how a segmenter can change the text. In those 
instances that I see, trim is called like this:

trim(aJCas.getDocumentText(), span);

Original comment by richard.eckart on 7 Aug 2014 at 2:30

GoogleCodeExporter commented 9 years ago

The span in this call is already [-1,1], as the output from the underlying 
segmenter could not be found in the original string (line 95).

Original comment by torsten....@gmail.com on 7 Aug 2014 at 2:33

GoogleCodeExporter commented 9 years ago

For the upstream tokenization code, it may be acceptable that the tokenizer 
also normalizes. But if that it is the case, you could consider this:

Create a normalizer component that uses the tokenizer only to update the text 
and which adds no segmentation.
Create a segmenter component that runs after the normalizer and which adds the 
token boundaries *assuming* that the tokenizer does not try to normalize 
already normalized text.
Theoretically, you could wrap both acts into one normalizer/segmenter 
component, but... maybe not so nice.
... if you care to follow that way, drop me a note, because I have a nice 
CAS-multiplier-based normalization base class lying around that still didn't 
make it into DKPro Core.

Otherwise, it would be good if the upstream tokenizer allowed some way of 
getting the original offsets corresponding to the normalized tokens.

Original comment by richard.eckart on 7 Aug 2014 at 2:38

GoogleCodeExporter commented 9 years ago

I will probably not have time to go the long way with this issue in the near 
future.

I filed an issue upstream.

Original comment by torsten....@gmail.com on 7 Aug 2014 at 2:41

GoogleCodeExporter commented 9 years ago

Original comment by richard.eckart on 14 Aug 2014 at 10:08

Added labels: DKPro-ASL, Module-clearnlp

GoogleCodeExporter commented 9 years ago

https://github.com/clearnlp/clearnlp/issues/5

Original comment by richard.eckart on 14 Aug 2014 at 10:09

kulukimak / dkpro-core-asl

ClearNlpSegmenter chokes on character: ’ #445