Open benadelm opened 4 years ago
Some googling suggests that many people experience similar problems due to this bug in Xalan. Does your code use Xalan (maybe indirectly through UIMA)?
Could be. Can you check if the problem is also in the current beta version of 2.0.0? I've updated the UIMA dependencies.
No, still the same exception.
When the attached UTF-8 text file (
Unicode-Test.txt
) is imported into CorefAnnotator and then saved, the attached XMI file is generated (Unicode-Test-xmi.txt
, originallyUnicode-Test.xmi
, but GitHub does not allow me to upload.xmi
files), which in turn cannot be opened again:(The same error occurs when trying to load that file in a different program with Java’s SAX parser for XML.)
There is only one Unicode character in the text file: 😂 U+1F602 FACE WITH TEARS OF JOY
This character is displayed correctly in the editor window after importing the text file; just saving it does not seem to work. Judging from the column number given in the error message, the problem lies in the
sofaString
of the followingsofa
:Since U+1F602 is a code point outside the Basic Multilingual Plane (BMP), Java’s internal
String
representation (UTF-16) needs twochar
s to represent it. It looks like those twochar
s are escaped individually, which seems to be invalid in XML.When using Java’s
javax.xml.transform.Transformer
to create an XML file for aorg.w3c.dom.Document
where the value of an attribute is set to U+1F602 (that is, to"\uD83D\uDE02"
), that attribute value becomes"😂"
, so I think the abovesofa
should look like this:Occurred in this release of CorefAnnotator with Java 13; the
javax.xml.transform.Transformer
test program delivered the above-mentioned output both when run with Java 13 and when run with Java 8.Full stack trace of the exception:
Unicode-Test.txt Unicode-Test-xmi.txt