reckart / tt4j

TreeTagger for Java
http://reckart.github.io/tt4j/
Apache License 2.0
16 stars 7 forks source link

Support text contain XML/SGML tags #24

Open simonmeoni opened 8 years ago

simonmeoni commented 8 years ago

I have a problem when I execute this code, I have just delete the sgml args and it's only this argument that it cause problem when it is not present . The processus never terminates his execution and the program enter on a infinite loop when it execute the function process. The infinite loop is on the line 591 on TreeTaggerWrapper.class file. I try to debug them but no sucess ... Do you have any idea where is the problem ? Thanks in advance, Simon

reckart commented 8 years ago

TT4J wraps the actual text in tags like "" and "". If it doesn't see these tags again as they are on the output, then it will hang. Cf. line 969 in TreeTaggerWrapper 1.2.1.

simonmeoni commented 8 years ago

The problem is due to this two variables :

private static final String STARTOFTEXT = "<This-is-the-start-of-the-text />";
private static final String ENDOFTEXT = "<This-is-the-end-of-the-text />"

TreeTagger needs to ignore this sgml tag to works correctly with the wrapper. It is possible to don't send this two String ? I think the problem come from (line 1120 of TreeTaggerWrapper.class):

    void run()
    {
        try {
            final OutputStream os = _proc.getOutputStream();

            _pw = new PrintWriter(new BufferedWriter(
                new OutputStreamWriter(os, _model.getEncoding())));

            send(STARTOFTEXT);

            while (tokenIterator.hasNext()) {
                O token = tokenIterator.next();
                _lastTokenWritten = token;
                _tokensWritten++;
                send(getText(token));
            }

            send(ENDOFTEXT);
            send(_model.getFlushSequence());
        }
        catch (final Throwable e) {
            _exception = e;
        }
    }

Thanks in advance, Simon

simonmeoni commented 8 years ago

I have found the solution. I replace the line 969 by this on the TreeTaggerWrapper.class :

                if (outRecord.contains(STARTOFTEXT)) {
                    inText = true;
                    if (TRACE) {
                        System.err.println("["+TreeTaggerWrapper.this+
                                "|TRACE] ("+_tokensRead+") START ["+outRecord+"]");
                    }
                    continue;
                }

                if (outRecord.contains(ENDOFTEXT)) {
                    if (TRACE) {
                        System.err.println("["+TreeTaggerWrapper.this+
                                "|TRACE] ("+_tokensRead+") COMPLETE ["+outRecord+"]");
                    }
                    break;
                }

and it's working when I don't have the -sgml option :).

reckart commented 8 years ago

Thanks for testing this. I'll implement a different solution though that doesn't change existing behavior. What I will do is: check if the "-sgml" flag is present (the default). If the flag is present, continue with the present code. If the flag is not present, try checking specifically if the token text is the start/end marker, probably using "startsWith" instead of "contains".

reckart commented 8 years ago

@Alpha34587 could you please check if the changes I made work for you as well?

simonmeoni commented 8 years ago

Yes the change sounds good for me :) Thanks !