martinblech / xmltodict

Python module that makes working with XML feel like you are working with JSON
MIT License
5.49k stars 462 forks source link

trailing space is stripped from CDATA #338

Closed ghost closed 2 months ago

ghost commented 11 months ago

for example this xml

<parent>
        <element><![CDATA[data    ]]></element>  
</parent>

let's parse it

import xmltodict

xml = """
<parent>
        <element><![CDATA[data    ]]></element>
</parent>
"""
parsed_xml = xmltodict.parse(xml)
print(repr(parsed_xml['parent']['element']))

result: 'data'

expected result: 'data '

untangle library is able to correctly parse it: https://pypi.org/project/untangle/

import untangle

obj = untangle.parse(xml)
print(repr(obj.parent.element.cdata))

result 'data '

ibrahelsheikh commented 11 months ago

yes i have same problem

afbwilliam commented 2 months ago

So, I encountered the same problem. The solution is to pass "strip_whitespace=False" as an optional argument to xmltodict.parse(). So, for the above example, this should do the trick:

import xmltodict

xml = """
<parent>
        <element><![CDATA[data    ]]></element>
</parent>
"""
parsed_xml = xmltodict.parse(xml, strip_whitespace=False)
print(repr(parsed_xml['parent']['element']))

I discovered this after turning on debugging mode and stepping through the code. It would be nice if xmltodict's user documentation was more robust, so users don't have dig into the code to investigate this in the first place.