Closed reckart closed 9 years ago
Original issue 2 created by reckart on 2010-09-03T11:05:21.000Z:
For tt4j to be ready to use as-is, there is one feature that is missing : the integration of a tokenizer.
tt4j does a great job wrapping TT in Java, though it is restricted to the tree-tagger executable only.
In order to use it on textual data one does need a tokenizer as the input of tt4j is a List<Token>.
TreeTagger's distribution includes two scripts : tokenize.perl and utf8-tokenize.perl which just does a decent job as tokenizers.
Moreover, the logic used by tt4j to talk to TT seems quite likely to be reused for the perl script.
Could such a development be on the roadmap of tt4j ?
If not, what do you propose as a workaround ?
Thanks,
Comment #1 originally posted by reckart on 2010-09-13T20:51:05.000Z:
Java comes with the BreakIterator class which is a tokenizer or sentence splitter - depending on how you call the constructor.
Comment #2 originally posted by reckart on 2010-09-13T21:15:29.000Z:
You may find the SimpleTokenizer useful.
Original issue 2 created by reckart on 2010-09-03T11:05:21.000Z:
For tt4j to be ready to use as-is, there is one feature that is missing : the integration of a tokenizer.
tt4j does a great job wrapping TT in Java, though it is restricted to the tree-tagger executable only.
In order to use it on textual data one does need a tokenizer as the input of tt4j is a List<Token>.
TreeTagger's distribution includes two scripts : tokenize.perl and utf8-tokenize.perl which just does a decent job as tokenizers.
Moreover, the logic used by tt4j to talk to TT seems quite likely to be reused for the perl script.
Could such a development be on the roadmap of tt4j ?
If not, what do you propose as a workaround ?
Thanks,