metafacture / metafacture-core

Core package of the Metafacture tool suite for metadata processing.
https://metafacture.org
Apache License 2.0
69 stars 34 forks source link

Add option for encode-marcXml and encode-xml for escaping unicode characters #529

Open TobiasNx opened 2 months ago

TobiasNx commented 2 months ago

https://metafacture.org/playground/?flux=inputFile%0A%7Copen-file%0A%7Cas-lines%0A%7Cdecode-json%0A%7Cencode-xml%28rootTag%3D%22collection%22%29%0A%7Cprint%0A%3B&data=%7B%22name%22%3A+%22Open+Educational+Resources+-+\u000Beine+kritische+Einf%C3%BChrung%22%7D

When special unicode characters are part of the transformed metadata encode-xml and encode-marcXml might transform the metadata to invalid xml.

There should be the possibility to create valid xml at least by adding an option for escping unicode characters.

blackwinter commented 2 months ago

The behaviour regarding unicode characters seems to have changed in 844aa74. Previously, StringEscapeUtils.escapeXml() used to escape unicode characters to their numerical \u equivalent.

Phu2 commented 2 months ago

I had a similar issue with Catmandu some years ago.

Phu2 commented 2 months ago

Two things are mixed up here, i think:

  1. escaping characters like & using XML entities
  2. characters that are not allowed at all in XML

The issue pointed out by @TobiasNx belongs to 2.

See also https://stackoverflow.com/questions/730133/what-are-invalid-characters-in-xml

blackwinter commented 2 months ago

Two things are mixed up here, i think:

What makes you think that? Syntax markers (1) are still escaped. AFAIUI, we're only talking about escaping invalid characters (2) here.

Phu2 commented 2 months ago

What makes you think that?

The docs of StringEscapeUtils.escapeXml() say it does 1. and besides of that:

Note that unicode characters greater than 0x7f are currently escaped to their numerical \u equivalent. This may change in future releases.

That's why i think it does not 2. (remove or replace invalid characters).

As a side note, from the logs of QA catalogue:

Apr 30, 2024 8:55:35 AM de.gwdg.metadataqa.marc.cli.utils.RecordIterator processFile
INFO: processing: oersiMarcDump.xml.tar.gz
[main] INFO org.reflections.Reflections - Reflections took 300 ms to scan 1 urls, producing 3 keys and 453 values
[Fatal Error] :3319174:57: An invalid XML character (Unicode: 0xb) was found in the element content of the document.

0xb is below 0x7f according to https://www.utf8-zeichentabelle.de/unicode-utf8-table.pl?number=1024&names=-&utf8=0x so this control character won't be handled by escapeXml() at all, i think.

Phu2 commented 2 months ago

I think the title of this issue is misleading, too.

blackwinter commented 2 months ago

Now I see what you mean, thanks. I haven't actually looked at the character values in question ;)

This implies that the behaviour in Metafacture hasn't changed, but rather that it's always been this way (with regard to invalid XML characters below 0x7f).

This also means that simply passing escapeUnicode=true to XmlUtil.escape() won't affect all invalid characters (namely, those below 0x7f).

Phu2 commented 2 months ago

Yep. See https://www.w3.org/TR/xml/#charsets for all invalid and "discouraged" characters in XML.

The behaviour regarding unicode characters seems to have changed in https://github.com/metafacture/metafacture-core/commit/844aa74334d672a7a19b3a98ce8b979ef58d5a1a.

Looking at SimpleXmlWriter.java it seems to me that XML is being written "manually" (eg. builder.append("<");) and not by a XML library which would capture those invalid characters.

blackwinter commented 2 months ago

I would suggest the following:

  1. Keep this issue as feature request for exposing an option to escapeUnicode on SimpleXmlEncoder and MarcXmlEncoder (if there is a use case).
  2. Open a new issue for the bug that SimpleXmlEncoder and MarcXmlEncoder (as well as XmlElementSplitter?) may produce invalid XML. (Could probably be fixed in XmlUtil.escapeCodePoint(), regardless of escapeUnicode.)