reckart / tt4j

TreeTagger for Java
http://reckart.github.io/tt4j/
Apache License 2.0
16 stars 7 forks source link

Very very long... How I can increase speed? #29

Open Mikhail42 opened 6 years ago

Mikhail42 commented 6 years ago

I use TreeTaggerWrapper and russian.par model. But I need wait at least 3 seconds for find 100 lemm!!! My PC can caluclate 2.7*10^9 operation per seconds, but I got only 100 words per 3 second!

Can I change something for speedup? Sorry for my English.

reckart commented 6 years ago

Starting up takes a while. The question is: how long does it take to process a single long text? If you process many small texts, make sure you re-use the TreeTaggerWrapper and do not create a new one every time.

Mikhail42 commented 6 years ago

@reckart I'm don't create a new every time. I'm know that start take a time, but for 10000 words and I'm wait more that 6 minutes before I got lemms! My code (i'm remove logs):

    `System.setProperty("treetagger.home", Util.basePath + "/TreeTagger");
    TreeTaggerWrapper<String> tt = new TreeTaggerWrapper<>();
    try {
        StringStringMap ssm = new StringStringMap();
        tt.setModel(modelPath + ":" + encoding);
        tt.setHandler((token, pos, lemma) -> {
            if (!token.equals(lemma)) ssm.put(new LightString(token), new LightString(lemma));
        });
        tt.process(words);
        logger.info("end process. try write result");
        ssm.write(Util.basePath + "lemmtt4j.ssm");
    } finally {
        tt.destroy();
    }`

Here StringStringMap is more eficient than HashMap<String, String>

reckart commented 6 years ago

There is a large-volume unit test in DKPro Core which uses TT4J to run TreeTagger. On my machine, this test generates document with 4249999 characters and 1250000 tokens and it takes around 12 seconds to process it using a TreeTagger model for English.

Have you tried commenting out your ssm lines to see if your slowness is really from TreeTagger and not from you own code?