martinblech / xmltodict

Python module that makes working with XML feel like you are working with JSON
MIT License
5.49k stars 462 forks source link

parse-unparse does not roundtrip on mixed content model #282

Closed marc-portier closed 2 years ago

marc-portier commented 2 years ago

to simply reproduce:

import xmltodict
mix = xmltodict.parse('<mix>before <nested>inside</nested> after</mix>')
xmltodict.unparse(mix, full_document=False)

'<mix><nested>inside</nested>before  after</mix>'

the before-after text gets somehow joined into one '#text' node

javadev commented 2 years ago

It may be converted to this json

{
  "mix": {
    "#text": "before ",
    "nested": "inside",
    "#text1": " after"
  },
  "#omit-xml-declaration": "yes"
}
marc-portier commented 2 years ago

@javadev thx for the suggestion, made a great entry into trying for a fix in #286

however, after dealing with pesky whitespace issues I've know come to realize that rountripping has more challenges, as this "aggregating values into lists" is rather native to how xmltodict is working.

considering this next step case:

<mix>before <nested>1st</nested> between <nested>2nd</nested> after</mix>

we will be hard pressed to 1 / either produce this structure:

{
  "mix": {
    "#text": "before ",
    "nested": "1st",
    "#text1": " between ",
    "nested": "2nd",
    "#text2": " after"
  }
}

(note that duplicate keys are in fact allowed in json spec to support streaming use cases «a silly fun fact few people realise» but they will surely not fly in python-dict-land)

or to 2/ guarantee the round-trip from what this currently produces:

{
  "mix": {
    "#text": "before ",
    "nested": ["1st", "2nd"],
    "#text1": " between ",
    "#text2": " after"
  }
}

all in all this starts to look as more then xmltodict was designed for...

javadev commented 2 years ago
<mix>before <nested>1st</nested> between <nested>2nd</nested> after</mix>

may be converted to this json

{
  "mix": {
    "#text": "before ",
    "nested": [
      "1st",
      {
        "#item": {
          "#text1": " between "
        }
      },
      "2nd"
    ],
    "#text2": " after"
  },
  "#omit-xml-declaration": "yes"
}
marc-portier commented 2 years ago

possibly. yet somehow this reads as more proof of the claim "more then xmltodict was designed for" ?

javadev commented 2 years ago

I wonder why xmltodict so popular. We have alternative in java .

mpf82 commented 2 years ago

I wonder why xmltodict so popular. We have alternative in java .

An "alternatvive in java" is pretty useless for a Python developer.

marc-portier commented 2 years ago

switched gears on this and created https://github.com/vliz-be-opsci/py-xmlasdict different approach here is to work with native eltree (not sax) and keep the element-tree in memory, wrapped so to make it (somewhat) behave as a dict()

closing this issue with "wontfix" as