reckart / tt4j

TreeTagger for Java
http://reckart.github.io/tt4j/
Apache License 2.0
16 stars 7 forks source link

Could some last "special" characters lead to lost tree-tagger processes? #6

Closed reckart closed 9 years ago

reckart commented 9 years ago

Original issue 6 created by reckart on 2011-10-21T15:06:10.000Z:

Hello,

When I apply tree-tagger with the Chinese parameter file provided by S; Sharoff on a file tokenized by the Chinese Segmenter provided by E. Peterson (mandarintools.com)

tree-tagger -quiet -no-unknown -sgml -token -lemma /usr/local/share/tree-tagger/lib/zh.par zh-test.txt

It provides the output int he attached file called zh-test.res

However, I suppose that I've got the following problem with tt4j: the tree-tagger process doesn't end even if every tokens have been processed successfully!

Could it be due to a jump to skipToken in the method removeProblematicTokens of the class TreeTaggerWrapper?

Thanks in advance, Jérôme Rocheteau

PS I use tt4j within the following UIMA wrapper:

http://code.google.com/p/ttc-project/source/browse/trunk/modules/uima-tree-tagger-wrapper/sources/fr/univnantes/lina/uima/engines/TreeTaggerWrapper.java

It could have somme bugs.

reckart commented 9 years ago

Comment #1 originally posted by reckart on 2011-10-21T15:11:52.000Z:

The attached file of this comment provides logs about the previous process. Actually, it miss the last line : « INFO: Stop Treetagger»

That's the bug I would like to fix!

Thanks in advance, Jérôme R

reckart commented 9 years ago

Comment #2 originally posted by reckart on 2011-10-21T15:24:49.000Z:

Hi Jérôme,

I am not sure if I understand your problem. I gather that you get the expected output but you notice that in the end the tree-tagger process still is running. If this is your problem, then it's a feature in TT4J and a bug in your wrapper. Override the "destroy()" method in your UIMA wrapper and invoke TreeTaggerWrapper.destroy() there to stop the background process.

Also comprehensive implementation of an UIMA integration for TreeTagger with TT4J can be found here:

http://code.google.com/p/dkpro-core-asl/source/browse/de.tudarmstadt.ukp.dkpro.core-asl/trunk/de.tudarmstadt.ukp.dkpro.core.treetagger

Maybe you want use that instead of writing the whole thing again from scratch.

-- Richard

reckart commented 9 years ago

Comment #3 originally posted by reckart on 2011-10-24T15:41:34.000Z:

Hi Richard,

It's not a bug of the UIMA Wrapper. You'll find attached a CLI to tt4j. The problem remains the same. I turn on the trace mode (see zh-test.dbg attached).

The fact is that tt4j reader doesn't receive the ENDOFTEXT tag "" although it has been send by the tt4j writer!

Thanks in advance Jérôme

PS: I won't have written another uima wrapper for tree-tagger if I had known yours before :) It looks great.

reckart commented 9 years ago

Comment #4 originally posted by reckart on 2011-10-24T17:23:29.000Z:

Thank you for your investigation of the issue. I'll have a look at as soon as possible. Meanwhile, if you are inclined to continue investigating the issue, I suggest you try adding more ".\n" to the flush sequence in http://code.google.com/p/tt4j/source/browse/tt4j/trunk/org.annolab.tt4j/src/main/java/org/annolab/tt4j/DefaultModel.java - since the data in your zh-test.dbg shows that input and output remain in sync until the end, increasing the length of the flush sequence is a good candidate to fixing the problem. Or maybe a different flush sequence is required for chinese.

reckart commented 9 years ago

Comment #5 originally posted by reckart on 2011-10-24T17:37:23.000Z:

Ok. Setting up a test was faster than I though ;) The problem is the flush sequence. It seems that tree-tagger ignores the "." which I usually use to flush the output. When I change the flush sequence to ".\n.\n.\n.\n.\n.\n.\n(\n)\n" it works fine. For the other languages that I have tests for so far, that also works out, so I think I'll just change the default flush sequence.

reckart commented 9 years ago

Comment #6 originally posted by reckart on 2011-10-24T18:43:38.000Z:

The changed flush sequence is in release 1.0.16 which should arrive in an hour or so on Maven Central. It worked for me in a test case that I set up with the DKPro Core TreeTagger wrapper. It should work for you as well.

reckart commented 9 years ago

Comment #7 originally posted by reckart on 2011-10-25T07:54:14.000Z:

Thank you very much Richard. it works fine for me too :)

reckart commented 9 years ago

Comment #8 originally posted by reckart on 2011-10-25T08:01:20.000Z:

<empty>