integration of TreeTagger's tokenizer script

reckart commented 9 years ago

Original issue 2 created by reckart on 2010-09-03T11:05:21.000Z:

For tt4j to be ready to use as-is, there is one feature that is missing : the integration of a tokenizer.

tt4j does a great job wrapping TT in Java, though it is restricted to the tree-tagger executable only.

In order to use it on textual data one does need a tokenizer as the input of tt4j is a List<Token>.

TreeTagger's distribution includes two scripts : tokenize.perl and utf8-tokenize.perl which just does a decent job as tokenizers.

Moreover, the logic used by tt4j to talk to TT seems quite likely to be reused for the perl script.

Could such a development be on the roadmap of tt4j ?

If not, what do you propose as a workaround ?

Thanks,

reckart commented 9 years ago

Comment #1 originally posted by reckart on 2010-09-13T20:51:05.000Z:

Java comes with the BreakIterator class which is a tokenizer or sentence splitter - depending on how you call the constructor.

reckart commented 9 years ago

Comment #2 originally posted by reckart on 2010-09-13T21:15:29.000Z:

You may find the SimpleTokenizer useful.

reckart / tt4j