reckart / tt4j

TreeTagger for Java
http://reckart.github.io/tt4j/
Apache License 2.0
16 stars 7 forks source link

Problem with p=0.5 in parameter file #30

Closed berndmoos closed 6 years ago

berndmoos commented 6 years ago

For two texts A and B and two parameter files X and Y, my application of tt4j runs fine for A-X, A-Y and B-X, but fails for B-Y. I can make out that the harmless word "seid" seems to be the culprit, i.e. if I remove or change it, B-Y runs fine, too. It seems then that something is special for that word in parameter file Y.

I found that the method public void probability(String pos, String lemma, double probability) in my implementation of org.annolab.tt4j.ProbabilityHandler<String> is called twice when that form is encountered, once with {lemma=sein, pos=VAIMP, p=0.5}, and once again with {lemma=sein, pos=VAFIN, p=0.5}.

My suspicion therefore is that parameter file entries with p=0.5 cause that behaviour because for p=0.5, there will be two candidates at the top of the list. Is there some way that I can tell tt4j to just (arbitrarily) pick one of the two possibilities and call probability() just once?

Any help would be appreciated.

reckart commented 6 years ago

probability() is called for all probabilities returned by treetagger in the order they are returned. It is called in between calls to token(). So what you can do is e.g. set a flag in your code when token() is called and when that flag is set, then store the values passed to probability() and then clear the flag such that values passed in further calls to probability() are ignored in your code.

berndmoos commented 6 years ago

Thanks so much for the explanation. Setting the flag as you suggest seems to do the trick.