Running the ixa-pipe-nerc if there is a smiley or other strange character in the NAF file causes comments to be written that later mess up the java XML parser (which apparently can't handle unicode in comments; see https://bugs.openjdk.java.net/browse/JDK-8072081)
Running this through ixa-pipe-nerc causes the comments to be recreated (probably by parsing and then re-serializing the NAF file), resulting in test2.naf (i.amcat.nl/test.naf):
This file then chokes the java parser in this and other modules:
$ java -jar modules/ixa-pipe-nerc/target/ixa-pipe-nerc-1.6.0-exec.jar tag -m tools/nerc-models-1.5.4/nl/nl-6-class-clusters-sonar.bin < /tmp/test2.naf
CLI options: Namespace(lexer=off, model=tools/nerc-models-1.5.4/nl/nl-6-class-clusters-sonar.bin, dictPath=off, outputFormat=naf, dictTag=off, language=null, clearFeatures=no)
Exception in thread "main" org.jdom2.input.JDOMParseException: Error on line 60: An invalid XML character (Unicode: 0xd83d) was found in the comment.
[...]
$ bash modules/OntoTagger/scripts/predicate-matrix-tagger.sh < /tmp/test2.naf
org.xml.sax.SAXParseException; lineNumber: 60; columnNumber: 11; An invalid XML character (Unicode: 0xd83d) was found in the comment.
[...]
(cc @rubenIzquierdo, @piekvossen)
Running the ixa-pipe-nerc if there is a smiley or other strange character in the NAF file causes comments to be written that later mess up the java XML parser (which apparently can't handle unicode in comments; see https://bugs.openjdk.java.net/browse/JDK-8072081)
The test.naf file (available at i.amcat.nl/test.naf) was generated with Alpino and the ixa-pipe-tok / morphosyntactic_parser_nl modules, which was recently fixed to escape unicode in comments (https://github.com/cltl/morphosyntactic_parser_nl/issues/9)
Running this through ixa-pipe-nerc causes the comments to be recreated (probably by parsing and then re-serializing the NAF file), resulting in test2.naf (i.amcat.nl/test.naf):
This file then chokes the java parser in this and other modules:
All modules were freshly pulled from github and installed using this script: https://github.com/vanatteveldt/newsreader_pipe_nl/blob/master/install.sh
Some system info:
I also tested this on a machine with sun java, same result: