BreakIterator annotator

noushadali / cleartk

Automatically exported from code.google.com/p/cleartk

0 stars 0 forks source link

I am building a uima wrapper for the java.text.BreakIterator class. This will provide a simple way to create a tokenizer and sentence segementer without having to load any models or bring any dependency on the opennlp sentence segmenter when sentence segmentation is only needed for testing purposes.

I have committed org.cleartk.token.breakit.BreakIteratorAnnotator which works 
for the word break iterator and sentence break iterator for a user specified 
locale and annotation type.  I was slightly lazy and didn't mess with the 
indexes the sentence break iterator produces.  It splits the text up such that 
none of the provided text is not in a sentence - e.g. trailing white space is 
included in the preceding sentence.  

I didn't bother with the Character or Line break iterators.  If you need 
support for these, then submit a new issue.  

The code is committed to the CleartkProjectReOrg branch in to the cleartk-token 
project and will appear in trunk after this branch is merged.

Original comment by pvogren@gmail.com on 29 Dec 2010 at 12:13

Changed state: Fixed

noushadali / cleartk

BreakIterator annotator #174