reckart / tt4j

TreeTagger for Java
http://reckart.github.io/tt4j/
Apache License 2.0
16 stars 7 forks source link

Running tt4j with other language configuration #27

Open BogdanMaier opened 7 years ago

BogdanMaier commented 7 years ago

Is there something implemeneted to run for instance the romanian configuration of TT? I understand we pass as paramter the TT source folder, but how do we controll all this attributes from the wrapper?

!/bin/sh

Set these paths appropriately

BIN=/Users/bogdan/private/resources/treetagger/bin CMD=/Users/bogdan/private/resources/treetagger/cmd LIB=/Users/bogdan/private/resources/treetagger/lib

OPTIONS="-token -lemma -sgml"

TOKENIZER=${CMD}/utf8-tokenize.perl TAGGER=${BIN}/tree-tagger ABBR_LIST=${LIB}/romanian-abbreviations PARFILE=${LIB}/romanian-utf8.par

$TOKENIZER -r -a $ABBR_LIST $* | ${CMD}/split-romanian.perl ${LIB}/romanian-tokens | $TAGGER $OPTIONS $PARFILE

Thanks, Bogdan

reckart commented 7 years ago

TT4J only covers the invocation of the actual binary. Tokenization is not supported by the wrapper. You have to do that yourself. Relevant options from your list would be

BIN=/Users/bogdan/private/resources/treetagger/bin
OPTIONS="-token -lemma -sgml"
TAGGER=${BIN}/tree-tagger
PARFILE=${LIB}/romanian-utf8.par
$TAGGER $OPTIONS $PARFILE

I have taken the example from the TT4J website here and added comments where your options fit in. Mind they do not fit in copy/paste, but I'm just pointing to related settings in TT4J.

package org.annolab.tt4j;

import static java.util.Arrays.asList;

public class Example {
  public static void main(String[] args) throws Exception {
    // Point TT4J to the TreeTagger installation directory. The executable is expected
    // in the "bin" subdirectory - in this example at "/opt/treetagger/bin/tree-tagger"
    System.setProperty("treetagger.home", "/opt/treetagger"); // <== options "BIN" and "TAGGER"
    TreeTaggerWrapper tt = new TreeTaggerWrapper<String>();
    try {
      tt.setModel("/opt/treetagger/models/english.par:iso8859-1"); // <== option "PARFILE"
      tt.setHandler(new TokenHandler<String>() {
        public void token(String token, String pos, String lemma) {
          System.out.println(token + "\t" + pos + "\t" + lemma);
        }
      });
      tt.process(asList(new String[] { "This", "is", "a", "test", "." }));
    }
    finally {
      tt.destroy();
    }
  }
}

The options -quiet -no-unknown -sgml -token -lemma are enabled by default. If you want to change that, you can call tt.setArguments(...). Mind that TT4J expects a certain output from TT, so the options should be at least -quiet -sgml -token -lemma.

See also: https://reckart.github.io/tt4j/usage.html

BogdanMaier commented 7 years ago

Thanks for the prompt response :)