ixa-ehu / ixa-pipe-nerc

IXA pipes Named Entity Tagger (http://ixa2.si.ehu.es/ixa-pipes).
Apache License 2.0
31 stars 23 forks source link

unicode characters are inserted into comments that java can't parse #10

Closed vanatteveldt closed 8 years ago

vanatteveldt commented 8 years ago

(cc @rubenIzquierdo, @piekvossen)

Running the ixa-pipe-nerc if there is a smiley or other strange character in the NAF file causes comments to be written that later mess up the java XML parser (which apparently can't handle unicode in comments; see https://bugs.openjdk.java.net/browse/JDK-8072081)

The test.naf file (available at i.amcat.nl/test.naf) was generated with Alpino and the ixa-pipe-tok / morphosyntactic_parser_nl modules, which was recently fixed to escape unicode in comments (https://github.com/cltl/morphosyntactic_parser_nl/issues/9)

Running this through ixa-pipe-nerc causes the comments to be recreated (probably by parsing and then re-serializing the NAF file), resulting in test2.naf (i.amcat.nl/test.naf):

$ java -jar modules/ixa-pipe-nerc/target/ixa-pipe-nerc-1.6.0-exec.jar tag -m tools/nerc-models-1.5.4/nl/nl-6-class-clusters-sonar.bin < /tmp/test.naf > /tmp/test2.naf
CLI options: Namespace(lexer=off, model=tools/nerc-models-1.5.4/nl/nl-6-class-clusters-sonar.bin, dictPath=off, outputFormat=naf, dictTag=off, language=null, clearFeatures=no)
ixa-pipe-nerc model loaded in: 5136 miliseconds ... [DONE]

$ grep -C1 "t_5" /tmp/test{,2}.naf/tmp/test.naf-    </term>
/tmp/test.naf:    <term id="t_5" lemma="😯" pos="noun" morphofeat="N(soort,ev,basis,zijd,stan)" type="open">
/tmp/test.naf-      <span>
--
/tmp/test.naf-        <span>
/tmp/test.naf:          <target id="t_5"/>
/tmp/test.naf-        </span>
--
/tmp/test.naf-    </dep>
/tmp/test.naf:    <dep from="t_0" to="t_5" rfunc="dp/dp">
/tmp/test.naf-      <!-- dp/dp(verb:ben,noun:) -->
--
/tmp/test2.naf-    <!--😯-->
/tmp/test2.naf:    <term id="t_5" type="open" lemma="&#x1f62f;" pos="noun" morphofeat="N(soort,ev,basis,zijd,stan)">
/tmp/test2.naf-      <span>
--
/tmp/test2.naf-    <!--dp/dp(Was, 😯)-->
/tmp/test2.naf:    <dep from="t_0" to="t_5" rfunc="dp/dp" />
/tmp/test2.naf-    <!--- / -(Was, ?)-->
--
/tmp/test2.naf-        <span>
/tmp/test2.naf:          <target id="t_5" />
/tmp/test2.naf-        </span>

This file then chokes the java parser in this and other modules:

$ java -jar modules/ixa-pipe-nerc/target/ixa-pipe-nerc-1.6.0-exec.jar tag -m tools/nerc-models-1.5.4/nl/nl-6-class-clusters-sonar.bin < /tmp/test2.naf 
CLI options: Namespace(lexer=off, model=tools/nerc-models-1.5.4/nl/nl-6-class-clusters-sonar.bin, dictPath=off, outputFormat=naf, dictTag=off, language=null, clearFeatures=no)
Exception in thread "main" org.jdom2.input.JDOMParseException: Error on line 60: An invalid XML character (Unicode: 0xd83d) was found in the comment.
[...]
$ bash modules/OntoTagger/scripts/predicate-matrix-tagger.sh  < /tmp/test2.naf
org.xml.sax.SAXParseException; lineNumber: 60; columnNumber: 11; An invalid XML character (Unicode: 0xd83d) was found in the comment.
[...]

All modules were freshly pulled from github and installed using this script: https://github.com/vanatteveldt/newsreader_pipe_nl/blob/master/install.sh

Some system info:

$ cat /etc/lsb-release 
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=14.04
DISTRIB_CODENAME=trusty
DISTRIB_DESCRIPTION="Ubuntu 14.04.3 LTS"
$ uname -a
Linux study-linux 3.13.0-71-generic #114-Ubuntu SMP Tue Dec 1 02:34:22 UTC 2015 x86_64 x86_64 x86_64 GNU/Linuxwva@study-linux: {master} ~/newsreader_pipe_nl
$ java -version
java version "1.7.0_101"
OpenJDK Runtime Environment (IcedTea 2.6.6) (7u101-2.6.6-0ubuntu0.14.04.1)
OpenJDK 64-Bit Server VM (build 24.95-b01, mixed mode)

I also tested this on a machine with sun java, same result:

$  java -jar modules/ixa-pipe-nerc/target/ixa-pipe-nerc-1.6.0-exec.jar tag -m tools/nerc-models-1.5.4/nl/nl-6-class-clusters-sonar.bin < /tmp/test.naf > /tmp/test2b.naf
CLI options: Namespace(dictTag=off, clearFeatures=no, dictPath=off, model=tools/nerc-models-1.5.4/nl/nl-6-class-clusters-sonar.bin, language=null, outputFormat=naf, lexer=off)
ixa-pipe-nerc model loaded in: 7421 miliseconds ... [DONE]
$ java -jar modules/ixa-pipe-nerc/target/ixa-pipe-nerc-1.6.0-exec.jar tag -m tools/nerc-models-1.5.4/nl/nl-6-class-clusters-sonar.bin < /tmp/test2b.naf 
CLI options: Namespace(dictTag=off, clearFeatures=no, dictPath=off, model=tools/nerc-models-1.5.4/nl/nl-6-class-clusters-sonar.bin, language=null, outputFormat=naf, lexer=off)
Exception in thread "main" org.jdom2.input.JDOMParseException: Error on line 60: An invalid XML character (Unicode: 0xd83d) was found in the comment.
[...]
wva@AHV-ID-3523:~/newsreader_pipe_nl$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 14.04.4 LTS
Release:    14.04
Codename:   trusty
wva@AHV-ID-3523:~/newsreader_pipe_nl$ uname -a
Linux AHV-ID-3523 3.16.0-60-generic #80~14.04.1-Ubuntu SMP Wed Jan 20 13:37:48 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
wva@AHV-ID-3523:~/newsreader_pipe_nl$ java -version
java version "1.8.0_72"
Java(TM) SE Runtime Environment (build 1.8.0_72-b15)
Java HotSpot(TM) 64-Bit Server VM (build 25.72-b15, mixed mode)
wva@AHV-ID-3523:~/newsreader_pipe_nl$ 
ragerri commented 8 years ago

Hi Wouter,

Zuhaitz updated the kaflib parser and I updated the dependencies for the tok, pos and nerc pipes.

Rodrigo