integration of TreeTagger's tokenizer script

GoogleCodeExporter commented 9 years ago

For tt4j to be ready to use as-is, there is one feature that is missing : the 
integration of a tokenizer.

tt4j does a great job wrapping TT in Java, though it is restricted to the 
tree-tagger executable only.

In order to use it on textual data one does need a tokenizer as the input of 
tt4j is a List<Token>.

TreeTagger's distribution includes two scripts : tokenize.perl and 
utf8-tokenize.perl which just does a decent job as tokenizers.

Moreover, the logic used by tt4j to talk to TT seems quite likely to be reused 
for the perl script.

Could such a development be on the roadmap of tt4j ?

If not, what do you propose as a workaround ?

Thanks,

Original issue reported on code.google.com by oddsk...@gmail.com on 3 Sep 2010 at 11:05

GoogleCodeExporter commented 9 years ago

Java comes with the BreakIterator class which is a tokenizer or sentence 
splitter - depending on how you call the constructor.

Original comment by richard.eckart on 13 Sep 2010 at 8:51

Changed state: WontFix

GoogleCodeExporter commented 9 years ago

You may find the SimpleTokenizer useful.

Original comment by richard.eckart on 13 Sep 2010 at 9:15

tema16 / tt4j

integration of TreeTagger's tokenizer script #2