nytud / hunlp-GATE

Lang_Hungarian - a GATE plugin containing Hungarian NLP tools as GATE processing resources
GNU General Public License v3.0
8 stars 6 forks source link

IOException in HFST-Wrapper #20

Open DavidNemeskey opened 7 years ago

DavidNemeskey commented 7 years ago

I get IOExceptions (more often IO Exception -- I guess it depends on where the error occurs, i.e. enough words are written to the stdin of the dead process) for some input to the HFST Analyzer module.

Example output:

IO Exception: null
IO Exception: null
IO Exception: null
IO Exception: null
IO Exception: null
IO Exception: null
IO Exception: null
java.io.IOException: Broken pipe
    at java.io.FileOutputStream.writeBytes(Native Method)
    at java.io.FileOutputStream.write(FileOutputStream.java:326)
    at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
    at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
    at sun.nio.cs.StreamEncoder.implFlush(StreamEncoder.java:297)
    at sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:141)
    at java.io.OutputStreamWriter.flush(OutputStreamWriter.java:229)
    at hu.nytud.hfst.Analyzer$WorkerProcess.write(Analyzer.java:218)
    at hu.nytud.hfst.Analyzer$WorkerProcess.run(Analyzer.java:206)
    at java.lang.Thread.run(Thread.java:745)

Example input from the Hungarian Webcorpus: ioexception.input.txt

The culprit is the very long token Pécs-Nagykanizsa-Graz-Aussee-Ischl-Salzburg-Zürich-Luzern-Rigire-Zürich-München-Linz-Bécs-Győr-Mohács-Pécs, but presumable other inputs could induce the error as well. What is strange is that if I run hfst-lookup with the same parameters it is run by GATE:

cat ioexception.input.txt | ../linux/hfst-lookup.sh --cascade=composition --xfst=print-pairs --xfst=print-space --pipe-mode -t 2 ../hu.hfstol

, it is processed without a hitch.

DavidNemeskey commented 7 years ago

The full example. I first ran it through quntoken (quntoken qterror.txt), and parsed the non-ws tokens from it. The resulting file is qterror.tokens.txt. Then I ran hfst-lookup on it, as described above, and no errors. I then tried it with GATE, and got the aforementioned problems. I also printed all tokens sent to HFST-Wrapper, and it is exactly the same as qterror.tokens.txt. So the error must be in the wrapper somewhere.

qterror.txt

qterror.tokens.txt