Closed marc-portier closed 2 years ago
left suggestion at https://github.com/martinblech/xmltodict/issues/282
https://appdividend.com/2020/11/20/how-to-convert-python-string-to-dictionary/ --> tried out ast.literal_eval() - generates syntaxError from line 1.. --> maybe generator expressions are an option to generate the type of dict suggested? - though seems a bit tedious - but then again, the eml does follow a hierarchical structure that could be used to split the string on?
https://appdividend.com/2020/11/20/how-to-convert-python-string-to-dictionary/ --> tried out ast.literal_eval() - generates syntaxError from line 1.. --> maybe generator expressions are an option to generate the type of dict suggested? - though seems a bit tedious - but then again, the eml does follow a hierarchical structure that could be used to split the string on?
?? did you try out ast.literal_eval()
on the xml string? it is not intended for that imho --> ast is there to parse strings representing dicts in python code, not xml content
so I think we either need to fix xmldict (to keep the order of text and nested nodes in its ordered_dict construct) or find a better xml-parsing lib that doesn't have this problem
lxml has .tail() that provides a way to handle mixed content (https://www.py4u.net/discuss/261738) -> might prove to be a better xml-parsing library?
coded a fix for the mixed content-model support of xmltodict
in branch called bug/282
created pull-request for integration --> martinblech/xmltodict#286
quite unsure how fast this can land and be published, in the mean time we could try pinning down the xmltodict
dependency to the github link --> in the requirements.txt
# xmltodict
-e https://github.com/marc-portier/xmltodict/tree/bug/282
think I have a stop gap with a very important known issue however
while it now nicely splits-up the different #text parts to reproduce them nicely... it now only shows a continued failure at globbing together nested elements
see description at https://github.com/martinblech/xmltodict/issues/282#issuecomment-1006563634
as things are the intended roundtrip on mixed-content model will still fail as soon as any nested element gets repeated
e.g.
>>> from xmltodict import parse, unparse
>>> xml="<mix>this <n>1st</n> works</mix>"
>>> rt=unparse(parse(xml, mixed_content=True), full_document=False)
>>> print(xml, rt, sep="\n")
<mix>this <n>1st</n> works</mix>
<mix>this <n>1st</n> works</mix>
>>> xml="<mix>this <n>1st</n> or <n>2nd</n> that</mix>"
>>> rt=unparse(parse(xml, mixed_content=True), full_document=False)
>>> print(xml, rt, sep="\n")
<mix>this <n>1st</n> or <n>2nd</n> that</mix>
<mix>this <n>1st</n><n>2nd</n> or that</mix>
>>>
a more in depth solution is in order,
looking at the intended use and behaviour of xmltodict however we should probably abandon that route and either
working towards an alternative via https://github.com/vliz-be-opsci/py-xmlasdict
introducing xmlasdict to replace xmltodict through changes 987d731bfc54
we had some time now to test, safe to close we can open other issues if we need additional features / functionality
This is due to us using xmltodict who does strange things with mixed-content-model
Would be nicest if the parse/unparse on the level of https://github.com/martinblech/xmltodict could be made to gently roundtrip on mixed content model.