Open Mikhail42 opened 6 years ago
Starting up takes a while. The question is: how long does it take to process a single long text? If you process many small texts, make sure you re-use the TreeTaggerWrapper and do not create a new one every time.
@reckart I'm don't create a new every time. I'm know that start take a time, but for 10000 words and I'm wait more that 6 minutes before I got lemms! My code (i'm remove logs):
`System.setProperty("treetagger.home", Util.basePath + "/TreeTagger");
TreeTaggerWrapper<String> tt = new TreeTaggerWrapper<>();
try {
StringStringMap ssm = new StringStringMap();
tt.setModel(modelPath + ":" + encoding);
tt.setHandler((token, pos, lemma) -> {
if (!token.equals(lemma)) ssm.put(new LightString(token), new LightString(lemma));
});
tt.process(words);
logger.info("end process. try write result");
ssm.write(Util.basePath + "lemmtt4j.ssm");
} finally {
tt.destroy();
}`
Here StringStringMap is more eficient than HashMap<String, String>
There is a large-volume unit test in DKPro Core which uses TT4J to run TreeTagger. On my machine, this test generates document with 4249999 characters and 1250000 tokens and it takes around 12 seconds to process it using a TreeTagger model for English.
Have you tried commenting out your ssm
lines to see if your slowness is really from TreeTagger and not from you own code?
I use TreeTaggerWrapper and russian.par model. But I need wait at least 3 seconds for find 100 lemm!!! My PC can caluclate 2.7*10^9 operation per seconds, but I got only 100 words per 3 second!
Can I change something for speedup? Sorry for my English.