bug: XMLSource doesn't handle mixed content model well

marc-portier commented 2 years ago

This is due to us using xmltodict who does strange things with mixed-content-model

import xmltodict
mix = xmltodict.parse('<mix>before <nested>inside</nested> after</mix>')
xmltodict.unparse(mix, full_document=False)

'<mix><nested>inside</nested>before  after</mix>'

Would be nicest if the parse/unparse on the level of https://github.com/martinblech/xmltodict could be made to gently roundtrip on mixed content model.

marc-portier commented 2 years ago

left suggestion at https://github.com/martinblech/xmltodict/issues/282

laurianvm commented 2 years ago

https://appdividend.com/2020/11/20/how-to-convert-python-string-to-dictionary/ --> tried out ast.literal_eval() - generates syntaxError from line 1.. --> maybe generator expressions are an option to generate the type of dict suggested? - though seems a bit tedious - but then again, the eml does follow a hierarchical structure that could be used to split the string on?

marc-portier commented 2 years ago

https://appdividend.com/2020/11/20/how-to-convert-python-string-to-dictionary/ --> tried out ast.literal_eval() - generates syntaxError from line 1.. --> maybe generator expressions are an option to generate the type of dict suggested? - though seems a bit tedious - but then again, the eml does follow a hierarchical structure that could be used to split the string on?

?? did you try out ast.literal_eval() on the xml string? it is not intended for that imho --> ast is there to parse strings representing dicts in python code, not xml content

so I think we either need to fix xmldict (to keep the order of text and nested nodes in its ordered_dict construct) or find a better xml-parsing lib that doesn't have this problem

laurianvm commented 2 years ago

lxml has .tail() that provides a way to handle mixed content (https://www.py4u.net/discuss/261738) -> might prove to be a better xml-parsing library?

marc-portier commented 2 years ago

coded a fix for the mixed content-model support of xmltodict in branch called bug/282

created pull-request for integration --> martinblech/xmltodict#286

quite unsure how fast this can land and be published, in the mean time we could try pinning down the xmltodict dependency to the github link --> in the requirements.txt

# xmltodict   
-e https://github.com/marc-portier/xmltodict/tree/bug/282

marc-portier commented 2 years ago

think I have a stop gap with a very important known issue however

while it now nicely splits-up the different #text parts to reproduce them nicely... it now only shows a continued failure at globbing together nested elements

see description at https://github.com/martinblech/xmltodict/issues/282#issuecomment-1006563634

as things are the intended roundtrip on mixed-content model will still fail as soon as any nested element gets repeated

e.g.

>>> from xmltodict import parse, unparse
>>> xml="<mix>this <n>1st</n> works</mix>"
>>> rt=unparse(parse(xml, mixed_content=True), full_document=False)
>>> print(xml, rt, sep="\n") 
<mix>this <n>1st</n> works</mix>
<mix>this <n>1st</n> works</mix>
>>> xml="<mix>this <n>1st</n> or <n>2nd</n> that</mix>"
>>> rt=unparse(parse(xml, mixed_content=True), full_document=False)
>>> print(xml, rt, sep="\n") 
<mix>this <n>1st</n> or <n>2nd</n> that</mix>
<mix>this <n>1st</n><n>2nd</n> or  that</mix>
>>>

a more in depth solution is in order,

looking at the intended use and behaviour of xmltodict however we should probably abandon that route and either

use it as useful inspiration to recreate a custom solution
leave sax and streaming behind and re-address this via DOM-like, maybe even xpath approaches ?

marc-portier commented 2 years ago

working towards an alternative via https://github.com/vliz-be-opsci/py-xmlasdict

marc-portier commented 2 years ago

introducing xmlasdict to replace xmltodict through changes 987d731bfc54

marc-portier commented 2 years ago

we had some time now to test, safe to close we can open other issues if we need additional features / functionality

vliz-be-opsci / pysubyt

bug: XMLSource doesn't handle mixed content model well #30