tema16 / tt4j

Automatically exported from code.google.com/p/tt4j
0 stars 0 forks source link

integration of TreeTagger's tokenizer script #2

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
For tt4j to be ready to use as-is, there is one feature that is missing : the 
integration of a tokenizer.

tt4j does a great job wrapping TT in Java, though it is restricted to the 
tree-tagger executable only.

In order to use it on textual data one does need a tokenizer as the input of 
tt4j is a List<Token>.

TreeTagger's distribution includes two scripts : tokenize.perl and 
utf8-tokenize.perl which just does a decent job as tokenizers.

Moreover, the logic used by tt4j to talk to TT seems quite likely to be reused 
for the perl script.

Could such a development be on the roadmap of tt4j ?

If not, what do you propose as a workaround ?

Thanks,  

Original issue reported on code.google.com by oddsk...@gmail.com on 3 Sep 2010 at 11:05

GoogleCodeExporter commented 9 years ago
Java comes with the BreakIterator class which is a tokenizer or sentence 
splitter - depending on how you call the constructor.

Original comment by richard.eckart on 13 Sep 2010 at 8:51

GoogleCodeExporter commented 9 years ago
You may find the SimpleTokenizer useful.

Original comment by richard.eckart on 13 Sep 2010 at 9:15