vliz-be-opsci / pysubyt

python module for Linked Data production (aka semantic uplifting) through Templating
MIT License
0 stars 0 forks source link

bug: XMLSource doesn't handle mixed content model well #30

Closed marc-portier closed 2 years ago

marc-portier commented 2 years ago

This is due to us using xmltodict who does strange things with mixed-content-model

import xmltodict
mix = xmltodict.parse('<mix>before <nested>inside</nested> after</mix>')
xmltodict.unparse(mix, full_document=False)

'<mix><nested>inside</nested>before  after</mix>'

Would be nicest if the parse/unparse on the level of https://github.com/martinblech/xmltodict could be made to gently roundtrip on mixed content model.

marc-portier commented 2 years ago

left suggestion at https://github.com/martinblech/xmltodict/issues/282

laurianvm commented 2 years ago

https://appdividend.com/2020/11/20/how-to-convert-python-string-to-dictionary/ --> tried out ast.literal_eval() - generates syntaxError from line 1.. --> maybe generator expressions are an option to generate the type of dict suggested? - though seems a bit tedious - but then again, the eml does follow a hierarchical structure that could be used to split the string on?

marc-portier commented 2 years ago

https://appdividend.com/2020/11/20/how-to-convert-python-string-to-dictionary/ --> tried out ast.literal_eval() - generates syntaxError from line 1.. --> maybe generator expressions are an option to generate the type of dict suggested? - though seems a bit tedious - but then again, the eml does follow a hierarchical structure that could be used to split the string on?

?? did you try out ast.literal_eval() on the xml string? it is not intended for that imho --> ast is there to parse strings representing dicts in python code, not xml content

so I think we either need to fix xmldict (to keep the order of text and nested nodes in its ordered_dict construct) or find a better xml-parsing lib that doesn't have this problem

laurianvm commented 2 years ago

lxml has .tail() that provides a way to handle mixed content (https://www.py4u.net/discuss/261738) -> might prove to be a better xml-parsing library?

marc-portier commented 2 years ago

coded a fix for the mixed content-model support of xmltodict in branch called bug/282

created pull-request for integration --> martinblech/xmltodict#286

quite unsure how fast this can land and be published, in the mean time we could try pinning down the xmltodict dependency to the github link --> in the requirements.txt

# xmltodict   
-e https://github.com/marc-portier/xmltodict/tree/bug/282
marc-portier commented 2 years ago

think I have a stop gap with a very important known issue however

while it now nicely splits-up the different #text parts to reproduce them nicely... it now only shows a continued failure at globbing together nested elements

see description at https://github.com/martinblech/xmltodict/issues/282#issuecomment-1006563634

as things are the intended roundtrip on mixed-content model will still fail as soon as any nested element gets repeated

e.g.

>>> from xmltodict import parse, unparse
>>> xml="<mix>this <n>1st</n> works</mix>"
>>> rt=unparse(parse(xml, mixed_content=True), full_document=False)
>>> print(xml, rt, sep="\n") 
<mix>this <n>1st</n> works</mix>
<mix>this <n>1st</n> works</mix>
>>> xml="<mix>this <n>1st</n> or <n>2nd</n> that</mix>"
>>> rt=unparse(parse(xml, mixed_content=True), full_document=False)
>>> print(xml, rt, sep="\n") 
<mix>this <n>1st</n> or <n>2nd</n> that</mix>
<mix>this <n>1st</n><n>2nd</n> or  that</mix>
>>> 

a more in depth solution is in order,

looking at the intended use and behaviour of xmltodict however we should probably abandon that route and either

marc-portier commented 2 years ago

working towards an alternative via https://github.com/vliz-be-opsci/py-xmlasdict

marc-portier commented 2 years ago

introducing xmlasdict to replace xmltodict through changes 987d731bfc54

marc-portier commented 2 years ago

we had some time now to test, safe to close we can open other issues if we need additional features / functionality