martinblech / xmltodict

Python module that makes working with XML feel like you are working with JSON
MIT License
5.49k stars 462 forks source link

something is wrong with html escaping #317

Open ignis32 opened 1 year ago

ignis32 commented 1 year ago

Code:

import xmltodict

test_xml="""<?xml version="1.0" encoding="utf-8"?>
<a> afafafa &#8211; </a> 
"""
print(test_xml)
print (xmltodict.unparse( xmltodict.parse(test_xml) ))

Output:

<?xml version="1.0" encoding="utf-8"?>
<a> afafafa &#8211; </a>

<?xml version="1.0" encoding="utf-8"?>
<a>afafafa –</a>

Basically, html-encoded en-dash is unescaped when parsed, but is not escaped when unparsed, generating different result then expected. (Expectations were that output of unparse would be the same as original test_xml string)

P.S. Thank you for the lib, I found it very handy for my needs.

bfontaine commented 1 year ago

See https://github.com/martinblech/xmltodict/issues/178.