tdwg / dwc

Darwin Core standard for sharing of information about biological diversity.
https://dwc.tdwg.org
Creative Commons Attribution 4.0 International
201 stars 70 forks source link

Darwin Core extension XML file building script. #487

Closed tucotuco closed 12 months ago

tucotuco commented 1 year ago

Seems to produce viable extension files. Seeking review.

baskaufs commented 1 year ago

Ran the output file through a validator and realized that there's a problem with unescaped quotes. In the description attribute for dcterms:bibliographicCitation there are quotes, so they prematurely end the attribute. The remainder of the description then isn't part of the value of an attribute, so the XML ends up malformed.

baskaufs commented 1 year ago

Possible fix:

from xml.sax.saxutils import escape

xml += f'    dc:description="{excape(extension.get("dc:description"))}">\n'
tucotuco commented 1 year ago

Thanks for pointing out the XML validity problem. I opted for a simple solution that doesn't invoke any additional libraries. Please review and merge if you agree.

baskaufs commented 1 year ago

That's not working -- it encoded all of the quotes everywhere in the XML: image

Also, if you are going to write your own escape function, I think the minimum characters you'll have to escape will also include < and & . They are fatal and at least & would be likely to occur in text. (There are other characters that are typically escaped like > and ' but I think you can get away with skipping them if you are careful.) That would make your function look like this:

def encoded_quotes(s):
    return s.replace("&", "&amp;").replace('"', '&quot;').replace("<", "&lt;")

The order of the methods would be important since you'd want to replace the ampersands first. Otherwise the ampersands in the escape sequences would also get escaped.

I can't see any way of getting around passing into the encoded_quotes function every extension.get(key) that could possibly have any of the banned characters, rather than passing in the entire xml string at the end. The whole point of having to escape those characters is that they can occur in the XML markup, so once you insert the variables into the string, the banned characters are going to be all over the place in the xml string.

I suppose you could just think hard about what fields would be likely to include the banned characters and only pass those fields into your encoded_quotes function (well, maybe change the name to encoded_characters, but honestly, to make it bulletproof, I'd just use the escape function for every data field. xml.sax.saxutils is in the standard library, so it's not like people who run the script are going to have to install it or anything.

tucotuco commented 1 year ago

Fair enough. Thanks. Will update.

tucotuco commented 12 months ago

Update uses escape, but with extra machinations to invoke escapes on double quotes.