r-lib / xml2

Bindings to libxml2
https://xml2.r-lib.org/
Other
220 stars 81 forks source link

Unicode control codes not supported #383

Closed jmendys closed 1 year ago

jmendys commented 1 year ago

Many ASCII characters provided in XML file as &... are causing "xmlParseCharRef: invalid xmlChar" error

Steps to reproduce:

  1. Create a sample file like: <?xml version="1.0" encoding="UTF-8"?>
  2. open the file with xml2::read_xml() command

Expected behavior:

  1. The file is opened without errors.
  2. The contains the encoded character

Actual behavior:

  1. The load fails with an error "xmlParseCharRef: invalid xmlChar"
  2. The input XML file remains locked (I can't delete or modify it) until the R session is terminated or restarted
  3. After multiple attempts if the R session is not restarted the R studio will eventually crash.

Note: Some (very few) characters entered like this works fine. For example: < or   or

hadley commented 1 year ago

The sample file you suggested didn't link, so I can't reproduce this. Would you mind including the XML inline?

jmendys commented 1 year ago

Hi, it is not a link but inlined XML. Let me try to make it again

<?xml version="1.0" encoding="UTF-8"?><body>&#x1B;</body>

hadley commented 1 year ago

I'm pretty sure that's a control code and isn't valid XML: https://www.w3.org/International/questions/qa-controls

jmendys commented 1 year ago

Hi, Yes, it is a control code. I have been using it in XSLTs to control text output. I don't have enough knowledge to judge from the article provided if they can be used in XML (or XSLT in particular) or not. I will rely on your judgment.

hadley commented 1 year ago

They're not supported in xml.