pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.92k stars 18.03k forks source link

ENH: to_xml, read_xml #27554

Closed codeman101 closed 3 years ago

codeman101 commented 5 years ago

I'm sure this has been asked before but I can't find it. Why doesn't pandas have a method called read_xml?

WillAyd commented 5 years ago

This was discussed in #4734 but never got anywhere. Are there any standards for representing / translating tabular data to/from XML?

codeman101 commented 5 years ago

@WillAyd

I don't know if there are any standards per se. I used pandas to read csv and json and wondered ever since why xml wasn't supported by pandas since it's also a very popular format. I always figured it was because in order to parse xml you have to know certain things about the data. I got this impression because every example I've seen of using the regular python xml parser to load the data into a pandas dataframe seem to require that the names of certain elements be known (or at least more than just the filename) but today a friend of mine told me that parsing xml is just as hard as json which is why I decided to ask this.

WillAyd commented 5 years ago

I'm not aware of any major objections (others can certainly chime in) but would just need research and a PR from the community

codeman101 commented 5 years ago

@WillAyd

Would be great. I used pandas for work and it would have been great if this was implemented. If it can be added that'd be great so I can add it to my code.

lithomas1 commented 4 years ago

I can look into this.

minchulkim87 commented 4 years ago

Hi I have to deal with XML data for work, and I love pandas. If a pd.read_xml() was implemented it would be amazing. I had to develop something quick, so I made this https://github.com/minchulkim87/pandas_read_xml I am not advanced enough to contribute to something as awesome as pandas, but if it is of any use then this work has been worth it.

ParfaitG commented 3 years ago

Has there been any progress on this ENH? While there is no standard XML, many public and private data dumps, repositories, and APIs that pandas users will interact with are shallower types that can migrate to two-dimensional (row by column) data frames. These flatter, one-level nested XML files should be able to migrate to data frames like HTML tables similar to the i/o solution: read_html. Additionally, pandas has much support for JSON types that too have no standardized format but pandas authors have implemented read_json, json_normalize, and even to_json. Even more, Python comes shipped with xml.etree.ElementTree like json in its standard library with c versions. So there is built-in support to handle XML files.

Other database and data science solutions provide such functionality of migrating XML documents to tables or datasets including:

I have answered many Stackoverflow questions where OPs regularly need a migration tool of XML to Data Frames, albeit with varying levels of nested documents. Surely, by pandas 2.0 maybe we can extend its rich IO tools for XML files.


While I would endeavor this PR, the pandas source code, object-oriented formatting and signatures, building from source in virtualenv, and testing is very intricate and my time at the moment is limited. Is there someone available (maybe authors of pandas.io.html.py) to help with this ENH PR?

Consider the below generalized version to read_xml and write_xml with handling of default namespaces, migrating both nodes and attributes data using Python's standard library. Example docs vary with nested levels. (I show LXML version for write_xml for pretty print output where I understand read_html interfaces with lxml if installed.) Likely, url check is not needed with pandas URL detector. Docs can emphasize this IO method works best with shallower XML files with instructions on how to parse with default namespaces (i.e., define prefix).

import pandas as pd
import xml.etree.ElementTree as et    # PYTHON STANDARD LIBRARY
from urllib import request            # PYTHON STANDARD LIBRARY

import lxml.etree as lxet             # THIRD-PARTY LIBRARY

def read_xml(io, nodes, namespace=None, url=False):
    # URL
    if url:
        rq = request.urlopen(io)
        xtree = et.fromstring(rq.read())
    else:
        # FILE
        xtree = et.parse(io)

    # LIST OF MERGED DICTIONARIES WITH TERTIARY OPERATOR FOR NAMESPACE TYPES
    data = [{ **{r.tag.split('}')[1] if namespace else r.tag: 
                 r.text.strip() if len(r.text.strip()) > 0 else None
                    for r in row.findall('*') }, **row.attrib
            } for row in xtree.findall(nodes, namespace)]

    # PASS PARSED DATA TO DataFrame CONSTRUCTOR
    return pd.DataFrame(data)

def write_xml(df, io):
    df_dict = df.to_dict(orient='index')

    # INITIALIZING XML FILE
    root = et.Element('data')

    # WRITING TO XML NODES 
    for i, d in df_dict.items():
        childRoot = et.SubElement(root, "row")

        for k, v in d.items():
            et.SubElement(childRoot, k).text = str(v)

    # SAVING XML FILE
    with open(io, 'wb') as f:
        f.write(et.tostring(root,
                            xml_declaration=True, 
                            encoding="UTF-8"))

def LXML_write_xml(df, io):
    df_dict = df.to_dict(orient='index')

    # INITIALIZING XML FILE
    root = lxet.Element('data')

    # WRITING TO XML NODES 
    for i, d in df_dict.items():
        childRoot = lxet.SubElement(root, "row")

        for k, v in d.items():
            lxet.SubElement(childRoot, k).text = str(v)

    tree = lxet.ElementTree(root)
    tree.write(io,
               pretty_print=True, 
               xml_declaration=True, 
               encoding="UTF-8")

    return None

''' 
EXAMPLES of read_xml()
''' 
xml_df = read_xml("https://opendata.maryland.gov/api/views/un65-7ipd/rows.xml?accessType=DOWNLOAD", 
                  nodes = ".//row/row", 
                  url=True)
print(xml_df)
#       year month  ... _position                                           _address
#  0    2002   JAN  ...         0  https://opendata.maryland.gov/resource/un65-7i...
#  1    2002   FEB  ...         0  https://opendata.maryland.gov/resource/un65-7i...
#  2    2002   MAR  ...         0  https://opendata.maryland.gov/resource/un65-7i...
#  3    2002   APR  ...         0  https://opendata.maryland.gov/resource/un65-7i...
#  4    2002   MAY  ...         0  https://opendata.maryland.gov/resource/un65-7i...
#  ..    ...   ...  ...       ...                                                ...
#  221  2020   JUN  ...         0  https://opendata.maryland.gov/resource/un65-7i...
#  222  2020   JUL  ...         0  https://opendata.maryland.gov/resource/un65-7i...
#  223  2020   AUG  ...         0  https://opendata.maryland.gov/resource/un65-7i...
#  224  2020   SEP  ...         0  https://opendata.maryland.gov/resource/un65-7i...
#  225  2020   OCT  ...         0  https://opendata.maryland.gov/resource/un65-7i...

xml_df = read_xml("http://ergast.com/api/f1/1954/results/1.xml",
                  nodes = ".//doc:Race", 
                  namespace = {'doc': "http://ergast.com/mrd/1.4"},
                  url=True)
print(xml_df)
#                RaceName  ...                                                url
# 0  Argentine Grand Prix  ...  http://en.wikipedia.org/wiki/1954_Argentine_Gr...
# 1      Indianapolis 500  ...  http://en.wikipedia.org/wiki/1954_Indianapolis...
# 2    Belgian Grand Prix  ...  http://en.wikipedia.org/wiki/1954_Belgian_Gran...
# 3     French Grand Prix  ...  http://en.wikipedia.org/wiki/1954_French_Grand...
# 4    British Grand Prix  ...  http://en.wikipedia.org/wiki/1954_British_Gran...
# 5     German Grand Prix  ...  http://en.wikipedia.org/wiki/1954_German_Grand...
# 6      Swiss Grand Prix  ...  http://en.wikipedia.org/wiki/1954_Swiss_Grand_...
# 7    Italian Grand Prix  ...  http://en.wikipedia.org/wiki/1954_Italian_Gran...
# 8    Spanish Grand Prix  ...  http://en.wikipedia.org/wiki/1954_Spanish_Gran...

xml_df = read_xml("https://www.sec.gov/info/edgar/edgartaxonomies.xml",
                  nodes = ".//Loc",
                  url=True)
print(xml_df)
#       Family Version  ...                             Namespace Prefix
# 0    US GAAP    2020  ...   http://xbrl.sec.gov/stpr/2018-01-31   stpr
# 1    US GAAP    2020  ...    http://xbrl.sec.gov/sic/2020-01-31    sic
# 2    US GAAP    2020  ...    http://xbrl.sec.gov/sic/2011-01-31    sic
# 3    US GAAP    2020  ...  http://xbrl.sec.gov/naics/2017-01-31  naics
# 4    US GAAP    2020  ...   http://xbrl.sec.gov/exch/2020-01-31   exch
# ..       ...     ...  ...                                   ...    ...
# 356     BASE    2010  ...          http://www.xbrl.org/2004/ref    ref
# 357     BASE    2010  ...          http://www.w3.org/1999/xlink  xlink
# 358     BASE    2010  ...        http://www.xbrl.org/2003/XLink     xl
# 359     BASE    2010  ...     http://www.xbrl.org/2003/linkbase   link
# 360     BASE    2010  ...     http://www.xbrl.org/2003/instance  xbrli

''' 
EXAMPLES OF write_xml() 
'''
write_xml(xml_df, 'Output.xml')
#<?xml version='1.0' encoding='UTF-8'?>
#<data><row><Family>US GAAP</Family><Version>2020</Version><Href>https://xbrl.sec.gov/...

LXML_write_xml(xml_df, 'Output.xml')
#<?xml version='1.0' encoding='UTF-8'?>
#<data>
#  <row>
#    <Family>US GAAP</Family>
#    <Version>2020</Version>
#    <Href>https://xbrl.sec.gov/stpr/2018/stpr-2018-01-31.xsd</Href>
#    <AttType>SCH</AttType>
#    <FileTypeName>Schema</FileTypeName>
#    <Elements>1</Elements>
#    <Namespace>http://xbrl.sec.gov/stpr/2018-01-31</Namespace>
#    <Prefix>stpr</Prefix>
#  </row>
#...
#</data>
jreback commented 3 years ago

@ParfaitG you are welcome to submit a PR - enhancements happen by volunteers

ParfaitG commented 3 years ago

Understood @jreback . I can get started with 4 PRs (via 4 different branches) following the style and signature of html handling and follow contribution guidelines:

  1. new xml.py script under 'pandas/io' for read_xml()
  2. new xml.py script under 'pandas/io/formats' for to_xml()
  3. new test_xml.py script under 'pandas/tests/io' for read_xml() tests
  4. new test_to_xml.py script under 'pandas/tests/io/formats' for to_xml() tests

I will emphasize in notes that read_xml() will best handle shallower XML documents. I plan to integrate XSLT 1.0 support in lxml to allow users via .xsl stylesheets to flatten complex documents for optimal data frame migration for read_xml and redesign raw output of to_xml.

After initial tested PRs, I can work with others more familiar with optimal object-oriented pandas style. Let me know your thoughts and if this works for the team I can reply with t a k e.

jreback commented 3 years ago

tests should be in the same PR as the code change

ParfaitG commented 3 years ago

Got it. Will plan for 2 PRs (1 and 3) for read_xml() and (2 and 4) for to_xml().

Edit: Will have to also make a small method add of to_xml() in DataFrameRenderer class of pandas/io/formats/format.py

ParfaitG commented 3 years ago

take

minchulkim87 commented 3 years ago

@ParfaitG you sir are a legend! So many people want/need a pandas way of reading XML.

For whatever it is worth (completely understand if none of this is useful), I have been using xmltodict as an intermediary from making XML a pandas dataframe. You know, leveraging existing tools. I thought I had tried to make it intuitive enough, but perhaps not. Here the root_key_list is the sequence of names of tags to get to the desired tag to consider as the "top" level. I thought I got the idea from pyspark

from typing import Optional, List, OrderedDict
import pandas as pd
import xmltodict

def get_to_root_in_dict(the_dict: OrderedDict, root_key_list: Optional[List[str]]=None) -> OrderedDict:
    if not root_key_list:
        return the_dict
    elif len(root_key_list) > 1:
        return get_to_root_in_dict(the_dict[root_key_list[0]], root_key_list[1:])
    else:
        return the_dict[root_key_list[0]]

def read_xml_as_dataframe(xml: str, root_key_list: Optional[List[str]]=None, transpose: bool=False) -> pd.DataFrame:
    if transpose:
        return pd.DataFrame(get_to_root_in_dict(xmltodict.parse(xml), root_key_list)).T
    else:
        return pd.DataFrame(get_to_root_in_dict(xmltodict.parse(xml), root_key_list))

I've also notice that XML being just a text file, they sometime come within zips or even double zips... So my makeshift solution (pandas-read-xml package) makes the "wrapper" read_xml() a generic xml string or url or zip or double zip reader. I'm sure the pandas io people are much better at such things and making a more consistent and considered API design, so my project remains a temporary solution for me.

Another thing I've had issues with XML data was that, unlike JSON data, the pandas json_normalize didn't always work well with the intended XML schema. In JSON, a single element can exist within a list, but in XML, if only one tag exists, then the xml to dict process does not put that element into a list. So the resulting dataframe ends up with a single column with some lists and some strings, and some nans. So my package had to include a flatten() method that made the json_normalize() and explode() more suitable for XML originating dataframes.

Anyways, thanks again for picking this up!

ParfaitG commented 3 years ago

@minchulkim87 - thanks! Your work is interesting and a convenience handler for those unfamiliar with the XML document model. I aim to focus on the XML standard with DOM support of elements, attributes, namespaces, even XPath and XSLT using XML handlers in Python's standard library and few core external libraries that pandas supports such as numpy and lxml. I believe pandas intends for less reliance on third-party packages to avoid dependencies that can break APIs. Please standby for updates!