Open TobiasNx opened 2 months ago
The behaviour regarding unicode characters seems to have changed in 844aa74. Previously, StringEscapeUtils.escapeXml()
used to escape unicode characters to their numerical \u
equivalent.
I had a similar issue with Catmandu some years ago.
Two things are mixed up here, i think:
&
using XML entitiesThe issue pointed out by @TobiasNx belongs to 2.
See also https://stackoverflow.com/questions/730133/what-are-invalid-characters-in-xml
Two things are mixed up here, i think:
What makes you think that? Syntax markers (1) are still escaped. AFAIUI, we're only talking about escaping invalid characters (2) here.
What makes you think that?
The docs of StringEscapeUtils.escapeXml()
say it does 1. and besides of that:
Note that unicode characters greater than 0x7f are currently escaped to their numerical \u equivalent. This may change in future releases.
That's why i think it does not 2. (remove or replace invalid characters).
As a side note, from the logs of QA catalogue:
Apr 30, 2024 8:55:35 AM de.gwdg.metadataqa.marc.cli.utils.RecordIterator processFile
INFO: processing: oersiMarcDump.xml.tar.gz
[main] INFO org.reflections.Reflections - Reflections took 300 ms to scan 1 urls, producing 3 keys and 453 values
[Fatal Error] :3319174:57: An invalid XML character (Unicode: 0xb) was found in the element content of the document.
0xb
is below 0x7f
according to https://www.utf8-zeichentabelle.de/unicode-utf8-table.pl?number=1024&names=-&utf8=0x so this control character won't be handled by escapeXml()
at all, i think.
I think the title of this issue is misleading, too.
Now I see what you mean, thanks. I haven't actually looked at the character values in question ;)
This implies that the behaviour in Metafacture hasn't changed, but rather that it's always been this way (with regard to invalid XML characters below 0x7f
).
This also means that simply passing escapeUnicode=true
to XmlUtil.escape()
won't affect all invalid characters (namely, those below 0x7f
).
Yep. See https://www.w3.org/TR/xml/#charsets for all invalid and "discouraged" characters in XML.
The behaviour regarding unicode characters seems to have changed in https://github.com/metafacture/metafacture-core/commit/844aa74334d672a7a19b3a98ce8b979ef58d5a1a.
Looking at SimpleXmlWriter.java
it seems to me that XML is being written "manually" (eg. builder.append("<");
) and not by a XML library which would capture those invalid characters.
I would suggest the following:
escapeUnicode
on SimpleXmlEncoder
and MarcXmlEncoder
(if there is a use case).SimpleXmlEncoder
and MarcXmlEncoder
(as well as XmlElementSplitter
?) may produce invalid XML. (Could probably be fixed in XmlUtil.escapeCodePoint()
, regardless of escapeUnicode
.)
https://metafacture.org/playground/?flux=inputFile%0A%7Copen-file%0A%7Cas-lines%0A%7Cdecode-json%0A%7Cencode-xml%28rootTag%3D%22collection%22%29%0A%7Cprint%0A%3B&data=%7B%22name%22%3A+%22Open+Educational+Resources+-+\u000Beine+kritische+Einf%C3%BChrung%22%7D
When special unicode characters are part of the transformed metadata encode-xml and encode-marcXml might transform the metadata to invalid xml.
There should be the possibility to create valid xml at least by adding an option for escping unicode characters.