nilsreiter / CorefAnnotator

Annotation tool for coreference
Apache License 2.0
31 stars 6 forks source link

Non-BMP Unicode code points break XMI files #338

Open benadelm opened 4 years ago

benadelm commented 4 years ago

When the attached UTF-8 text file (Unicode-Test.txt) is imported into CorefAnnotator and then saved, the attached XMI file is generated (Unicode-Test-xmi.txt, originally Unicode-Test.xmi, but GitHub does not allow me to upload .xmi files), which in turn cannot be opened again:

org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 2672; Character reference "&#55357" is an invalid XML character.

(The same error occurs when trying to load that file in a different program with Java’s SAX parser for XML.)

There is only one Unicode character in the text file: 😂 U+1F602 FACE WITH TEARS OF JOY

This character is displayed correctly in the editor window after importing the text file; just saving it does not seem to work. Judging from the column number given in the error message, the problem lies in the sofaString of the following sofa:

<cas:Sofa xmi:id="12" sofaNum="1" sofaID="_InitialView" mimeType="text" sofaString="&#55357;&#56834;"/>

Since U+1F602 is a code point outside the Basic Multilingual Plane (BMP), Java’s internal String representation (UTF-16) needs two chars to represent it. It looks like those two chars are escaped individually, which seems to be invalid in XML.

When using Java’s javax.xml.transform.Transformer to create an XML file for a org.w3c.dom.Document where the value of an attribute is set to U+1F602 (that is, to "\uD83D\uDE02"), that attribute value becomes "&#128514;", so I think the above sofa should look like this:

<cas:Sofa xmi:id="12" sofaNum="1" sofaID="_InitialView" mimeType="text" sofaString="&#128514;"/>

Occurred in this release of CorefAnnotator with Java 13; the javax.xml.transform.Transformer test program delivered the above-mentioned output both when run with Java 13 and when run with Java 8.

Full stack trace of the exception:

java.io.IOException: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 2672; Character reference "&#55357" is an invalid XML character.
        at java.util.concurrent.FutureTask.report(FutureTask.java:122) ~[?:?]
        at java.util.concurrent.FutureTask.get(FutureTask.java:191) ~[?:?]
        at javax.swing.SwingWorker.get(SwingWorker.java:613) ~[?:?]
        at de.unistuttgart.ims.coref.annotator.worker.JCasLoader.done(JCasLoader.java:147) ~[CorefAnnotator-1.14.3-full.jar:1.14.3]
        at javax.swing.SwingWorker$5.run(SwingWorker.java:750) ~[?:?]
        at javax.swing.SwingWorker$DoSubmitAccumulativeRunnable.run(SwingWorker.java:847) ~[?:?]
        at sun.swing.AccumulativeRunnable.run(AccumulativeRunnable.java:112) ~[?:?]
        at javax.swing.SwingWorker$DoSubmitAccumulativeRunnable.actionPerformed(SwingWorker.java:857) ~[?:?]
        at javax.swing.Timer.fireActionPerformed(Timer.java:317) ~[?:?]
        at javax.swing.Timer$DoPostEvent.run(Timer.java:249) ~[?:?]
        at java.awt.event.InvocationEvent.dispatch(InvocationEvent.java:313) ~[?:?]
        at java.awt.EventQueue.dispatchEventImpl(EventQueue.java:770) ~[?:?]
        at java.awt.EventQueue$4.run(EventQueue.java:721) ~[?:?]
        at java.awt.EventQueue$4.run(EventQueue.java:715) ~[?:?]
        at java.security.AccessController.doPrivileged(AccessController.java:391) [?:?]
        at java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(ProtectionDomain.java:85) [?:?]
        at java.awt.EventQueue.dispatchEvent(EventQueue.java:740) [?:?]
        at java.awt.EventDispatchThread.pumpOneEventForFilters(EventDispatchThread.java:203) [?:?]
        at java.awt.EventDispatchThread.pumpEventsForFilter(EventDispatchThread.java:124) [?:?]
        at java.awt.EventDispatchThread.pumpEventsForHierarchy(EventDispatchThread.java:113) [?:?]
        at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:109) [?:?]
        at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:101) [?:?]
        at java.awt.EventDispatchThread.run(EventDispatchThread.java:90) [?:?]
Caused by: java.io.IOException: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 2672; Character reference "&#55357" is an invalid XML character.
        at de.unistuttgart.ims.coref.annotator.plugins.DefaultImportPlugin.getJCas(DefaultImportPlugin.java:87) ~[CorefAnnotator-1.14.3-full.jar:1.14.3]
        at de.unistuttgart.ims.coref.annotator.worker.JCasLoader.readFile(JCasLoader.java:104) ~[CorefAnnotator-1.14.3-full.jar:1.14.3]
        at de.unistuttgart.ims.coref.annotator.worker.JCasLoader.doInBackground(JCasLoader.java:139) ~[CorefAnnotator-1.14.3-full.jar:1.14.3]
        at de.unistuttgart.ims.coref.annotator.worker.JCasLoader.doInBackground(JCasLoader.java:33) ~[CorefAnnotator-1.14.3-full.jar:1.14.3]
        at javax.swing.SwingWorker$1.call(SwingWorker.java:304) ~[?:?]
        at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
        at javax.swing.SwingWorker.run(SwingWorker.java:343) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
        at java.lang.Thread.run(Thread.java:830) ~[?:?]
Caused by: org.xml.sax.SAXParseException: Character reference "&#55357" is an invalid XML character.
        at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source) ~[CorefAnnotator-1.14.3-full.jar:1.14.3]
        at org.apache.uima.cas.impl.XmiCasDeserializer.deserialize(XmiCasDeserializer.java:2066) ~[CorefAnnotator-1.14.3-full.jar:1.14.3]
        at org.apache.uima.cas.impl.XmiCasDeserializer.deserialize(XmiCasDeserializer.java:1983) ~[CorefAnnotator-1.14.3-full.jar:1.14.3]
        at de.unistuttgart.ims.coref.annotator.plugins.DefaultImportPlugin.getJCas(DefaultImportPlugin.java:84) ~[CorefAnnotator-1.14.3-full.jar:1.14.3]
        at de.unistuttgart.ims.coref.annotator.worker.JCasLoader.readFile(JCasLoader.java:104) ~[CorefAnnotator-1.14.3-full.jar:1.14.3]
        at de.unistuttgart.ims.coref.annotator.worker.JCasLoader.doInBackground(JCasLoader.java:139) ~[CorefAnnotator-1.14.3-full.jar:1.14.3]
        at de.unistuttgart.ims.coref.annotator.worker.JCasLoader.doInBackground(JCasLoader.java:33) ~[CorefAnnotator-1.14.3-full.jar:1.14.3]
        at javax.swing.SwingWorker$1.call(SwingWorker.java:304) ~[?:?]
        at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
        at javax.swing.SwingWorker.run(SwingWorker.java:343) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
        at java.lang.Thread.run(Thread.java:830) ~[?:?]

Unicode-Test.txt Unicode-Test-xmi.txt

benadelm commented 4 years ago

Some googling suggests that many people experience similar problems due to this bug in Xalan. Does your code use Xalan (maybe indirectly through UIMA)?

nilsreiter commented 3 years ago

Could be. Can you check if the problem is also in the current beta version of 2.0.0? I've updated the UIMA dependencies.

benadelm commented 3 years ago

No, still the same exception.