Open GoogleCodeExporter opened 9 years ago
Correction: This is actually not a backtick, but a special quote character.
Original comment by torsten....@gmail.com
on 7 Aug 2014 at 2:05
Addendum: the stack trace indicates that this might actually be a problem of
the segmenterbase, not clearnlp
Caused by: java.lang.StringIndexOutOfBoundsException: String index out of
range: -1
at java.lang.String.charAt(String.java:658)
at de.tudarmstadt.ukp.dkpro.core.api.segmentation.SegmenterBase.trim(SegmenterBase.java:262)
at de.tudarmstadt.ukp.dkpro.core.api.segmentation.SegmenterBase.createToken(SegmenterBase.java:239)
at de.tudarmstadt.ukp.dkpro.core.api.segmentation.SegmenterBase.createToken(SegmenterBase.java:228)
at de.tudarmstadt.ukp.dkpro.core.clearnlp.ClearNlpSegmenter.process(ClearNlpSegmenter.java:102)
at de.tudarmstadt.ukp.dkpro.core.api.segmentation.SegmenterBase.process(SegmenterBase.java:124)
at de.tudarmstadt.ukp.dkpro.core.clearnlp.ClearNlpSegmenter.process(ClearNlpSegmenter.java:74)
at org.apache.uima.analysis_component.JCasAnnotator_ImplBase.process(JCasAnnotator_ImplBase.java:48)
at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:385)
Original comment by torsten....@gmail.com
on 7 Aug 2014 at 2:11
The problem is in line 95 of ClearNlpSegmenter
tBegin = aText.indexOf(token, tBegin);
This yields -1 as the call to the underlying segmenter returns a string where
’ has been changed to ' so the character cannot be found.
Anyone ideas how we could catch that in the wrapper?
Original comment by torsten....@gmail.com
on 7 Aug 2014 at 2:20
The segmenter must not change the text ;)
Original comment by richard.eckart
on 7 Aug 2014 at 2:21
So I will file an issue upstream.
Original comment by torsten....@gmail.com
on 7 Aug 2014 at 2:23
Actually I don't understand how a segmenter can change the text. In those
instances that I see, trim is called like this:
trim(aJCas.getDocumentText(), span);
Original comment by richard.eckart
on 7 Aug 2014 at 2:30
The span in this call is already [-1,1], as the output from the underlying
segmenter could not be found in the original string (line 95).
Original comment by torsten....@gmail.com
on 7 Aug 2014 at 2:33
For the upstream tokenization code, it may be acceptable that the tokenizer
also normalizes. But if that it is the case, you could consider this:
Create a normalizer component that uses the tokenizer only to update the text
and which adds no segmentation.
Create a segmenter component that runs after the normalizer and which adds the
token boundaries *assuming* that the tokenizer does not try to normalize
already normalized text.
Theoretically, you could wrap both acts into one normalizer/segmenter
component, but... maybe not so nice.
... if you care to follow that way, drop me a note, because I have a nice
CAS-multiplier-based normalization base class lying around that still didn't
make it into DKPro Core.
Otherwise, it would be good if the upstream tokenizer allowed some way of
getting the original offsets corresponding to the normalized tokens.
Original comment by richard.eckart
on 7 Aug 2014 at 2:38
I will probably not have time to go the long way with this issue in the near
future.
I filed an issue upstream.
Original comment by torsten....@gmail.com
on 7 Aug 2014 at 2:41
Original comment by richard.eckart
on 14 Aug 2014 at 10:08
https://github.com/clearnlp/clearnlp/issues/5
Original comment by richard.eckart
on 14 Aug 2014 at 10:09
Original issue reported on code.google.com by
torsten....@gmail.com
on 7 Aug 2014 at 2:03