noushadali / cleartk

Automatically exported from code.google.com/p/cleartk
0 stars 0 forks source link

BreakIterator annotator #174

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
I am building a uima wrapper for the java.text.BreakIterator class.  This will 
provide a simple way to create a tokenizer and sentence segementer without 
having to load any models or bring any dependency on the opennlp sentence 
segmenter when sentence segmentation is only needed for testing purposes.  

Original issue reported on code.google.com by pvogren@gmail.com on 28 Dec 2010 at 9:44

GoogleCodeExporter commented 9 years ago
I have committed org.cleartk.token.breakit.BreakIteratorAnnotator which works 
for the word break iterator and sentence break iterator for a user specified 
locale and annotation type.  I was slightly lazy and didn't mess with the 
indexes the sentence break iterator produces.  It splits the text up such that 
none of the provided text is not in a sentence - e.g. trailing white space is 
included in the preceding sentence.  

I didn't bother with the Character or Line break iterators.  If you need 
support for these, then submit a new issue.  

The code is committed to the CleartkProjectReOrg branch in to the cleartk-token 
project and will appear in trunk after this branch is merged.  

Original comment by pvogren@gmail.com on 29 Dec 2010 at 12:13