reckart / tt4j

TreeTagger for Java
http://reckart.github.io/tt4j/
Apache License 2.0
16 stars 7 forks source link

TT4J hangs with newer versions of TreeTagger #35

Open berndmoos opened 2 years ago

berndmoos commented 2 years ago

With otherwise identical code, my tagging process will hang when used with the latest version of TreeTagger (this is on Windows, but there are hints that the problem occurs on Mac OS, too). This is irrespective of the parameter file and the input used.

Here is a test case:

package org.exmaralda.tagging;

import java.util.Arrays;
import java.util.List;
import org.annolab.tt4j.TokenHandler;
import org.annolab.tt4j.TreeTaggerException;
import org.annolab.tt4j.TreeTaggerWrapper;

/**
 *
 * @author thomas.schmidt
 */
public class TestTT4J {

    /**
     * @param args the command line arguments
     */
    public static void main(String[] args) {
        new TestTT4J().doit();
    }

    // 04-11-2021, issue #286 : this one works
    //public static String TTC = "D:\\Dropbox\\TreeTagger";
    // 04-11-2021, issue #286 : and this one doesn't
    public static String TTC = "c:\\TreeTagger";
    public String parameterFile = "C:\\TreeTagger\\lib\\italian.par";
    public String[] options = {"-token","-lemma","-sgml","-no-unknown"};
    String parameterFileEncoding = "UTF-8"; 

    private void doit() {
         System.out.println("Setting up tagger");
         System.setProperty("treetagger.home", TTC);
         TreeTaggerWrapper tt = new TreeTaggerWrapper<String>();
         //uncomment next line to make TreeTaggerWrapper verbose
         tt.TRACE = true;
         tt.setProbabilityThreshold(0.999999);
         TokenHandler tokenHandler = new TokenHandler(){
            public void token(Object token, String pos, String lemma) {
                // do nothing
            }             
         };
         try {
             System.out.println("   Setting model");
             tt.setModel(parameterFile + ":" + parameterFileEncoding);
             System.out.println("   Setting arguments");
             tt.setArguments(options);
             System.out.println("   Setting handler");
             tt.setHandler(tokenHandler);
             System.out.println("Tagger setup complete");

             String[] tokenArray = {"uno", "due", "tre"};
             List tokens = Arrays.asList(tokenArray);
             tt.process(tokens);
             System.out.println("Tagging complete.");
         } catch (TreeTaggerException ex) {
            ex.printStackTrace();
         } catch (Exception ex){
             ex.printStackTrace();
         }
    }

}

With a newer version of TreeTagger, this will give me:

Setting up tagger
   Setting model
   Setting arguments
   Setting handler
Tagger setup complete
[org.annolab.tt4j.TreeTaggerWrapper@5197848c|TRACE] Invoking TreeTagger [c:\TreeTagger\bin\tree-tagger.exe -token -lemma -sgml -no-unknown -prob -threshold 0.999999000000 C:\TreeTagger\lib\italian.par]

... and then it hangs. With an older version of TreeTagger, I get:

Setting up tagger
   Setting model
   Setting arguments
   Setting handler
Tagger setup complete
[org.annolab.tt4j.TreeTaggerWrapper@5197848c|TRACE] Invoking TreeTagger [D:\Dropbox\TreeTagger\bin\tree-tagger.exe -token -lemma -sgml -no-unknown -prob -threshold 0.999999000000 C:\TreeTagger\lib\italian.par]
[org.annolab.tt4j.TreeTaggerWrapper@5197848c|TRACE] (0) START [<This-is-the-start-of-the-text />]
[org.annolab.tt4j.TreeTaggerWrapper@5197848c|TRACE] (1) IN [uno] -- OUT: [uno] -- POS: [DET:indef] -- LEMMA: [uno] -- PROBABILITY: [0.968146]
[org.annolab.tt4j.TreeTaggerWrapper@5197848c|TRACE] (2) IN [due] -- OUT: [due] -- POS: [ADJ] -- LEMMA: [due] -- PROBABILITY: [0.999754]
[org.annolab.tt4j.TreeTaggerWrapper@5197848c|TRACE] (3) IN [tre] -- OUT: [tre] -- POS: [ADJ] -- LEMMA: [tre] -- PROBABILITY: [0.988320]
[org.annolab.tt4j.TreeTaggerWrapper@5197848c|TRACE] (3) COMPLETE [<This-is-the-end-of-the-text />]
Tagging complete.
berndmoos commented 2 years ago

Is there anything I can do in the configuration to fix this? I am using OpenJDK 14. Any help would be much appreciated, the feature relying on TT4J is used a lot in EXMARaLDA, and we can only refer people to the older TreeTagger version at the moment.

berndmoos commented 2 years ago

See also https://github.com/Exmaralda-Org/exmaralda/issues/286

reckart commented 2 years ago

What happens if you enter the command that TT4J generates directly on the command line?

Have you asked the TreeTagger maintainer about changes in the way that TT handles stdin, stdout, and stderr?

berndmoos commented 2 years ago

What happens if you enter the command that TT4J generates directly on the command line?

With the command only, it hangs, too, because there is no input. When I pass it a text file as input, I get:

C:\Users\thomas.schmidt>D:\Dropbox\TreeTagger\bin\tree-tagger.exe -token -lemma -sgml -no-unknown -prob -threshold 0.999999000000 C:\TreeTagger\lib\italian.par C:\Users\thomas.schmidt\Desktop\123.txt
        reading parameters ...
        tagging ...
une     NOM une 0.736251
due     ADJ due 0.999291
tre     ADJ tre 0.981432
         finished.

Have you asked the TreeTagger maintainer about changes in the way that TT handles stdin, stdout, and stderr?

No, but I will now ;-)

reckart commented 2 years ago

When you try on the command line, try adding the <This-is-the-start-of-the-text /> and <This-is-the-end-of-the-text /> pseudo-elements and after the end-pseudo-element try entering something like This is a sentence . one token per line and in the language of the model - do not close the stdin stream - does it generate output?

TT may only generate output when the input stream is closed or after it has seen a certain amount of data. TT4J uses a model-specific "flush sequence" to convince TT to generate its output. Maybe something in relation to that has changed and TT4J is no longer able to convince TT to start generating.

You could also experiment in the code with a longer flush sequence.

You could try to add some printf debugging into the Reader and Writer inner classes in TreeTaggerWrapper to see in more detail how much data and which data TT4J sends to / receives from TT.

berndmoos commented 2 years ago

Thanks for the explanation.

Not sure I get the instructions right, but when I prepend the pseudo tags inside the input text file, I still get the expected output on the command line. Using a much longer input array in the above Java code doesn't change anything. Let's see if the TT developer has a hint...

reckart commented 2 years ago

@berndmoos Have you been able to fix your situation?

berndmoos commented 2 years ago

@berndmoos Have you been able to fix your situation?

Haven't found the time yet, but will ... sometime ... soon ... hopefully before next year.

berndmoos commented 2 years ago

Just a (non-)update:

I'm afraid I will have to give up on this. I can provide Windows users with an older version of TreeTagger. For MAC users, no older versions seem to be available for download. :-(

berndmoos commented 2 years ago

For the record: HS suggests to use a longer flush sequence. Right now, the flush sequence is defined here:

https://github.com/reckart/tt4j/blob/2a4fa8280b6fe60b72426bdac10794b5feb5d39f/src/main/java/org/annolab/tt4j/DefaultModel.java#L28

public static final String DEFAULT_FLUSH_SEQUENCE = "\n.\n.\n.\n.\n.\n(\n)\n.\n.\n.\n.\n";

Will experiment with this if and when I find the time.

berndmoos commented 1 year ago

Found the culprit at last:

tt.setProbabilityThreshold(0.999999);

If I do not set the probability threshold, TreeTagger will no longer hang. Setting the probability threshold to a different value (tried 0.9 / 1.0 / 0.5) makes no difference.

reckart commented 1 year ago

Interesting. The question is if it is a TT4J but or a treetagger bug... any hunch?

berndmoos commented 1 year ago

TreeTagger hanging in contect of probability is also mentioned in #13. Otherwise, no hunch. The reason I am setting the probability threshold is that I want to process probabilities as described in #13. That, however, is not central to most applications I have. For the time being, will remove those lines and use probability threshold only in contexts where I need them and can be sure that the older version of TreeTagger is used.

reckart commented 1 year ago

Wrt. probabilities parameter: Might be a detail that could be communicated to HS - maybe it sparks an idea why could be the root problem.