scrapinghub / js2xml

Convert Javascript code to an XML document
MIT License
186 stars 23 forks source link

add support for parsing JSON #15

Closed pawelmhm closed 7 years ago

pawelmhm commented 7 years ago

Sometimes I'm dealing with deeply nested JSON and I'd like to parse it with xpaths. I know I can use JMESPATH but I dont like JMESPATH syntax, I'm used to xpaths. Also seems like some features of xpaths are not supported by Jmespath, e.g. I can't do recursive traversal, so I can't write //mynode/text() in Jmes I have to do root.foo.bar[].foo2.bar2.product.mynode.

It seems like JSON is currently not supported by js2xml, e.g. something like this:

import js2xml
import json

alfa = json.dumps({"aa": "bb"})
print(alfa)
parsed = js2xml.parse(alfa)
print(parsed.pretty_print())

fails.

Is there some way to add support for JSON parser to js2xml? This would probably require conversion from Python dictionary to xml, seems like there are packages that do this https://github.com/delfick/python-dict2xml so maybe we could learn something from them.

Granitosaurus commented 7 years ago

Actually you can parse json with js2xml but it's really ugly and you need to wrap it with a variable assignment:

from lxml import etree
import js2xml

data = """{
    "one": {
        "two": [{
            "four": {
                "name": "four1_name"
            }
        }, {
            "four": {
                "name": "four2_name"
            }
        }]
    }
}"""
print(etree.tostring(js2xml.parse('var foo = ' + data), pretty_print=True))

Will give this:

<program>
  <var name="foo">
    <object>
      <property name="one">
        <object>
          <property name="two">
            <array>
              <object>
                <property name="four">
                  <object>
                    <property name="name">
                      <string>four1_name</string>
                    </property>
                  </object>
                </property>
              </object>
              <object>
                <property name="four">
                  <object>
                    <property name="name">
                      <string>four2_name</string>
                    </property>
                  </object>
                </property>
              </object>
            </array>
          </property>
        </object>
      </property>
    </object>
  </var>
</program>

which works but is really ugly because it's using property, object as node names instead of original names from the json. In otherwords it's parsing javascript from what clearly is a json. As mentioned by Pawel there already a package called dict2xml that already does this really well and is pretty simple (200 loc):

from dict2xml import dict2xml
print(dict2xml(json.loads(data)))

result:

<one>
  <two>
    <four>
      <name>four1_name</name>
    </four>
  </two>
  <two>
    <four>
      <name>four2_name</name>
    </four>
  </two>
</one>

It seems to be quite simple and maybe we could adapt it into js2xml?

Granitosaurus commented 7 years ago

There's also this: https://github.com/quandyfactory/dicttoxml which seems to be a bit more popular but it's essentially the same as dict2xml it just adds types to nodes.

from dicttoxml import dicttoxml
print(etree.tostring(etree.fromstring(dicttoxml(data)), pretty_print=True))
<root>
  <one type="dict">
    <two type="list">
      <item type="dict">
        <four type="dict">
          <name type="str">four1_name</name>
        </four>
      </item>
      <item type="dict">
        <four type="dict">
          <name type="str">four2_name</name>
        </four>
      </item>
    </two>
  </one>
</root>

Personally I don't like that it wraps every list element in <item> tag and it seems to have few more unnecessary quirks like that but in overal it might be a bit more robust than dict2xml. Both packages are worth lookin into for this imo.
Edit: actually the recent dicttoxml version allows to customize some of this stuff to get rid of wrapping and adding types to nodes.

Granitosaurus commented 7 years ago

Related Jmespath issue for non-rooted expressions: https://github.com/jmespath/jmespath.py/issues/110

redapple commented 7 years ago

@pawelmhm , I think using js2xml to query data from JSON is out of scope of what this library tries to do.

js2xml is very handy when extracting strings, numbers, JavaScript objects and arrays from assignments and function arguments (when writing regexes for them is tedious)

"XPath for JSON" is kind of a different (yet interesting) use-case. You mention JmesPath but there's also JSONPath and JSONiq.

As @Granitas mentions, the AST-like tree that js2xml outputs for a JSON dict is not that easy to work with. Which is why js2xml has methods to convert JavaScript objects to dicts.

I would personally leave the querying of data inside dicts out of js2xml.

As for why parsing a JSON object directly, without an assignment, does not work, it has to do with how slimit interprets snippets of code. May be worth fixing.

redapple commented 7 years ago

Closing this issue as traversing JSON using XPaths is not the purpose of js2xml.