Closed tucotuco closed 12 months ago
Ran the output file through a validator and realized that there's a problem with unescaped quotes. In the description attribute for dcterms:bibliographicCitation there are quotes, so they prematurely end the attribute. The remainder of the description then isn't part of the value of an attribute, so the XML ends up malformed.
Possible fix:
from xml.sax.saxutils import escape
xml += f' dc:description="{excape(extension.get("dc:description"))}">\n'
Thanks for pointing out the XML validity problem. I opted for a simple solution that doesn't invoke any additional libraries. Please review and merge if you agree.
That's not working -- it encoded all of the quotes everywhere in the XML:
Also, if you are going to write your own escape function, I think the minimum characters you'll have to escape will also include < and & . They are fatal and at least & would be likely to occur in text. (There are other characters that are typically escaped like > and ' but I think you can get away with skipping them if you are careful.) That would make your function look like this:
def encoded_quotes(s):
return s.replace("&", "&").replace('"', '"').replace("<", "<")
The order of the methods would be important since you'd want to replace the ampersands first. Otherwise the ampersands in the escape sequences would also get escaped.
I can't see any way of getting around passing into the encoded_quotes
function every extension.get(key)
that could possibly have any of the banned characters, rather than passing in the entire xml
string at the end. The whole point of having to escape those characters is that they can occur in the XML markup, so once you insert the variables into the string, the banned characters are going to be all over the place in the xml
string.
I suppose you could just think hard about what fields would be likely to include the banned characters and only pass those fields into your encoded_quotes
function (well, maybe change the name to encoded_characters
, but honestly, to make it bulletproof, I'd just use the escape
function for every data field. xml.sax.saxutils
is in the standard library, so it's not like people who run the script are going to have to install it or anything.
Fair enough. Thanks. Will update.
Update uses escape, but with extra machinations to invoke escapes on double quotes.
Seems to produce viable extension files. Seeking review.