tefra / xsdata

Naive XML & JSON Bindings for python
https://xsdata.readthedocs.io
MIT License
336 stars 61 forks source link

XMLParser does not create objects as expected #437

Closed ansFourtyTwo closed 3 years ago

ansFourtyTwo commented 3 years ago

Hi @tefra ,

I have problems parsing XML strings to actual objects defined by models created with xsdata. Please find attached the models I've created.

To give an example, a XML string could look like this:

<div>
   <p>Title: My title</p>
   <p>Name: ansFourtyTwo</p>
</div>

I am parsing the XML string to XhtmlDivType type from module w3_org_1999_xhtml.py with:

context = XmlContext()
parser = XmlParser(context=context)

xhtml_div = parser.from_string(xhtml_source, XhtmlDivType)

I in fact get a object of type XhtmlDivType, but the composition of the object is not what I would expect.

What I'd expect that with in xhtml_div I would find a non-empty list for attribute p, i.e.:

xhtml_div.p 
# <class 'list'>: [XhtmlPType( ... ), XhtmlPType( ...)] 

What I actually find is an empty list for xhtml_div.p and a list of AnyElements within xhtml_div.content:

xhtml_div.p
# >class 'list'>: []

xhtml_div.content
# <class 'list'>: [AnyElement(qname='p', text='Title: My title', tail=None, children=[], attributes={}), AnyElement(qname='p', text='Name: ansFourtyTwo', tail=None, children=[], attributes={})]

Is this somehow possible for the parser to automatically parse everything into the right data class? Am I missing something here? Is it a problem of namespaces, maybe?

All the best, ans = 42

reqif.zip

tefra commented 3 years ago

First of all yes the namespace is wrong, in order for the XhtmlPType to be used during binding your input should be like this

<div xmlns:xhtml="http://www.w3.org/1999/xhtml">
   <xhtml:p>Title: My title</p>
   <xhtml:p>Name: ansFourtyTwo</p>
</div>

But that won't do exactly what you want, instead you will get this structure with the latest xsdata

        from reqif.models import ReqIf, XhtmlDivType, XhtmlPType
        from xsdata.formats.dataclass.models.generics import DerivedElement

        XhtmlDivType(
            content=[
                DerivedElement(
                    qname="{http://www.w3.org/1999/xhtml}p",
                    value=XhtmlPType(content=["Title: My title"]),
                    substituted=False,
                ),
                DerivedElement(
                    qname="{http://www.w3.org/1999/xhtml}p",
                    value=XhtmlPType(content=["Name: ansFourtyTwo"]),
                    substituted=False,
                ),
            ]
        )

The complexType xhtml.div.type supports mixed content, which basically means the element supports tail content and elements in any possible order, xsdata in order to handle these cases, adds this content wildcard field that absorbs in a sense all child elements, in order to ensure roundtrip binding operations match, eg xml->python->xml

example

<div xmlns:xhtml="http://www.w3.org/1999/xhtml">
   <xhtml:p>Title: My title</p>Tail
   <xhtml:p><xhtml:small>Name:</small> ansFourtyTwo</p>
</div>
XhtmlDivType(
    content=[
        DerivedElement(
            qname="{http://www.w3.org/1999/xhtml}p",
            value=XhtmlPType(content=["Title: My title", "Tail\n   "]),
            substituted=False,
        ),
        DerivedElement(
            qname="{http://www.w3.org/1999/xhtml}p",
            value=XhtmlPType(
                content=[
                    DerivedElement(
                        qname="{http://www.w3.org/1999/xhtml}small",
                        value=XhtmlInlPresType(content=["Name:", " ansFourtyTwo"]),
                        substituted=False,
                    )
                ],
            ),
            substituted=False,
        ),
    ],
)

The DerivedElement is an internal generic object used to handle case type substitution cases. The whole behavior is a bit awkward, xsdata handling was modeled after java and jaxb because I honestly couldn't come up with a better solution solution.

I 've been thinking for some time now, for mixed content models the generator should remove all other fields in order to avoid some of the confusion. maybe it's time to actually do it and better documentation for mixed content in general

ansFourtyTwo commented 3 years ago

Hi @tefra ,

thank you for pointing this out. It is a lot clearer to me now.

All the best, ans = 42

tefra commented 3 years ago

Thank you for the reminder I need some examples in documentation!

Let me know if you have any ideas on how to improve mixed content handling