Closed codeman101 closed 3 years ago
This was discussed in #4734 but never got anywhere. Are there any standards for representing / translating tabular data to/from XML?
@WillAyd
I don't know if there are any standards per se. I used pandas to read csv and json and wondered ever since why xml wasn't supported by pandas since it's also a very popular format. I always figured it was because in order to parse xml you have to know certain things about the data. I got this impression because every example I've seen of using the regular python xml parser to load the data into a pandas dataframe seem to require that the names of certain elements be known (or at least more than just the filename) but today a friend of mine told me that parsing xml is just as hard as json which is why I decided to ask this.
I'm not aware of any major objections (others can certainly chime in) but would just need research and a PR from the community
@WillAyd
Would be great. I used pandas for work and it would have been great if this was implemented. If it can be added that'd be great so I can add it to my code.
I can look into this.
Hi I have to deal with XML data for work, and I love pandas. If a pd.read_xml() was implemented it would be amazing. I had to develop something quick, so I made this https://github.com/minchulkim87/pandas_read_xml I am not advanced enough to contribute to something as awesome as pandas, but if it is of any use then this work has been worth it.
Has there been any progress on this ENH? While there is no standard XML, many public and private data dumps, repositories, and APIs that pandas
users will interact with are shallower types that can migrate to two-dimensional (row by column) data frames. These flatter, one-level nested XML files should be able to migrate to data frames like HTML tables similar to the i/o solution: read_html
. Additionally, pandas
has much support for JSON types that too have no standardized format but pandas authors have implemented read_json
, json_normalize
, and even to_json
. Even more, Python comes shipped with xml.etree.ElementTree
like json
in its standard library with c versions. So there is built-in support to handle XML files.
Other database and data science solutions provide such functionality of migrating XML documents to tables or datasets including:
XML
package (extended from S language with libxml
sources) maintains the convenience handler xmlToDataFrame
(even undcoumented xmlAttrsToDataFrame
) for simple, shallow structure that does appear reasonably common;libname xml
engine to read and write XML documents from SAS datasets;Workbooks.OpenXML
method with stylesheet and data type rendering;LOAD XML
to parse XML data to database tables with specific set up of XML beforehand with client supporting --xml
for data output to XML;Application.ImportXML
and Application.ExportXML
to import XML documents to/from database tables.I have answered many Stackoverflow questions where OPs regularly need a migration tool of XML to Data Frames, albeit with varying levels of nested documents. Surely, by pandas
2.0 maybe we can extend its rich IO tools for XML files.
While I would endeavor this PR, the pandas
source code, object-oriented formatting and signatures, building from source in virtualenv, and testing is very intricate and my time at the moment is limited. Is there someone available (maybe authors of pandas.io.html.py
) to help with this ENH PR?
Consider the below generalized version to read_xml
and write_xml
with handling of default namespaces, migrating both nodes and attributes data using Python's standard library. Example docs vary with nested levels. (I show LXML version for write_xml
for pretty print output where I understand read_html
interfaces with lxml
if installed.) Likely, url check is not needed with pandas URL detector. Docs can emphasize this IO method works best with shallower XML files with instructions on how to parse with default namespaces (i.e., define prefix).
import pandas as pd
import xml.etree.ElementTree as et # PYTHON STANDARD LIBRARY
from urllib import request # PYTHON STANDARD LIBRARY
import lxml.etree as lxet # THIRD-PARTY LIBRARY
def read_xml(io, nodes, namespace=None, url=False):
# URL
if url:
rq = request.urlopen(io)
xtree = et.fromstring(rq.read())
else:
# FILE
xtree = et.parse(io)
# LIST OF MERGED DICTIONARIES WITH TERTIARY OPERATOR FOR NAMESPACE TYPES
data = [{ **{r.tag.split('}')[1] if namespace else r.tag:
r.text.strip() if len(r.text.strip()) > 0 else None
for r in row.findall('*') }, **row.attrib
} for row in xtree.findall(nodes, namespace)]
# PASS PARSED DATA TO DataFrame CONSTRUCTOR
return pd.DataFrame(data)
def write_xml(df, io):
df_dict = df.to_dict(orient='index')
# INITIALIZING XML FILE
root = et.Element('data')
# WRITING TO XML NODES
for i, d in df_dict.items():
childRoot = et.SubElement(root, "row")
for k, v in d.items():
et.SubElement(childRoot, k).text = str(v)
# SAVING XML FILE
with open(io, 'wb') as f:
f.write(et.tostring(root,
xml_declaration=True,
encoding="UTF-8"))
def LXML_write_xml(df, io):
df_dict = df.to_dict(orient='index')
# INITIALIZING XML FILE
root = lxet.Element('data')
# WRITING TO XML NODES
for i, d in df_dict.items():
childRoot = lxet.SubElement(root, "row")
for k, v in d.items():
lxet.SubElement(childRoot, k).text = str(v)
tree = lxet.ElementTree(root)
tree.write(io,
pretty_print=True,
xml_declaration=True,
encoding="UTF-8")
return None
'''
EXAMPLES of read_xml()
'''
xml_df = read_xml("https://opendata.maryland.gov/api/views/un65-7ipd/rows.xml?accessType=DOWNLOAD",
nodes = ".//row/row",
url=True)
print(xml_df)
# year month ... _position _address
# 0 2002 JAN ... 0 https://opendata.maryland.gov/resource/un65-7i...
# 1 2002 FEB ... 0 https://opendata.maryland.gov/resource/un65-7i...
# 2 2002 MAR ... 0 https://opendata.maryland.gov/resource/un65-7i...
# 3 2002 APR ... 0 https://opendata.maryland.gov/resource/un65-7i...
# 4 2002 MAY ... 0 https://opendata.maryland.gov/resource/un65-7i...
# .. ... ... ... ... ...
# 221 2020 JUN ... 0 https://opendata.maryland.gov/resource/un65-7i...
# 222 2020 JUL ... 0 https://opendata.maryland.gov/resource/un65-7i...
# 223 2020 AUG ... 0 https://opendata.maryland.gov/resource/un65-7i...
# 224 2020 SEP ... 0 https://opendata.maryland.gov/resource/un65-7i...
# 225 2020 OCT ... 0 https://opendata.maryland.gov/resource/un65-7i...
xml_df = read_xml("http://ergast.com/api/f1/1954/results/1.xml",
nodes = ".//doc:Race",
namespace = {'doc': "http://ergast.com/mrd/1.4"},
url=True)
print(xml_df)
# RaceName ... url
# 0 Argentine Grand Prix ... http://en.wikipedia.org/wiki/1954_Argentine_Gr...
# 1 Indianapolis 500 ... http://en.wikipedia.org/wiki/1954_Indianapolis...
# 2 Belgian Grand Prix ... http://en.wikipedia.org/wiki/1954_Belgian_Gran...
# 3 French Grand Prix ... http://en.wikipedia.org/wiki/1954_French_Grand...
# 4 British Grand Prix ... http://en.wikipedia.org/wiki/1954_British_Gran...
# 5 German Grand Prix ... http://en.wikipedia.org/wiki/1954_German_Grand...
# 6 Swiss Grand Prix ... http://en.wikipedia.org/wiki/1954_Swiss_Grand_...
# 7 Italian Grand Prix ... http://en.wikipedia.org/wiki/1954_Italian_Gran...
# 8 Spanish Grand Prix ... http://en.wikipedia.org/wiki/1954_Spanish_Gran...
xml_df = read_xml("https://www.sec.gov/info/edgar/edgartaxonomies.xml",
nodes = ".//Loc",
url=True)
print(xml_df)
# Family Version ... Namespace Prefix
# 0 US GAAP 2020 ... http://xbrl.sec.gov/stpr/2018-01-31 stpr
# 1 US GAAP 2020 ... http://xbrl.sec.gov/sic/2020-01-31 sic
# 2 US GAAP 2020 ... http://xbrl.sec.gov/sic/2011-01-31 sic
# 3 US GAAP 2020 ... http://xbrl.sec.gov/naics/2017-01-31 naics
# 4 US GAAP 2020 ... http://xbrl.sec.gov/exch/2020-01-31 exch
# .. ... ... ... ... ...
# 356 BASE 2010 ... http://www.xbrl.org/2004/ref ref
# 357 BASE 2010 ... http://www.w3.org/1999/xlink xlink
# 358 BASE 2010 ... http://www.xbrl.org/2003/XLink xl
# 359 BASE 2010 ... http://www.xbrl.org/2003/linkbase link
# 360 BASE 2010 ... http://www.xbrl.org/2003/instance xbrli
'''
EXAMPLES OF write_xml()
'''
write_xml(xml_df, 'Output.xml')
#<?xml version='1.0' encoding='UTF-8'?>
#<data><row><Family>US GAAP</Family><Version>2020</Version><Href>https://xbrl.sec.gov/...
LXML_write_xml(xml_df, 'Output.xml')
#<?xml version='1.0' encoding='UTF-8'?>
#<data>
# <row>
# <Family>US GAAP</Family>
# <Version>2020</Version>
# <Href>https://xbrl.sec.gov/stpr/2018/stpr-2018-01-31.xsd</Href>
# <AttType>SCH</AttType>
# <FileTypeName>Schema</FileTypeName>
# <Elements>1</Elements>
# <Namespace>http://xbrl.sec.gov/stpr/2018-01-31</Namespace>
# <Prefix>stpr</Prefix>
# </row>
#...
#</data>
@ParfaitG you are welcome to submit a PR - enhancements happen by volunteers
Understood @jreback . I can get started with 4 PRs (via 4 different branches) following the style and signature of html
handling and follow contribution guidelines:
xml.py
script under 'pandas/io' for read_xml()
xml.py
script under 'pandas/io/formats' for to_xml()
test_xml.py
script under 'pandas/tests/io' for read_xml()
teststest_to_xml.py
script under 'pandas/tests/io/formats' for to_xml()
testsI will emphasize in notes that read_xml()
will best handle shallower XML documents. I plan to integrate XSLT 1.0 support in lxml
to allow users via .xsl
stylesheets to flatten complex documents for optimal data frame migration for read_xml
and redesign raw output of to_xml
.
After initial tested PRs, I can work with others more familiar with optimal object-oriented pandas
style. Let me know your thoughts and if this works for the team I can reply with t a k e.
tests should be in the same PR as the code change
Got it. Will plan for 2 PRs (1 and 3) for read_xml()
and (2 and 4) for to_xml()
.
Edit: Will have to also make a small method add of to_xml()
in DataFrameRenderer
class of pandas/io/formats/format.py
take
@ParfaitG you sir are a legend! So many people want/need a pandas way of reading XML.
For whatever it is worth (completely understand if none of this is useful), I have been using xmltodict as an intermediary from making XML a pandas dataframe. You know, leveraging existing tools. I thought I had tried to make it intuitive enough, but perhaps not. Here the root_key_list is the sequence of names of tags to get to the desired tag to consider as the "top" level. I thought I got the idea from pyspark
from typing import Optional, List, OrderedDict
import pandas as pd
import xmltodict
def get_to_root_in_dict(the_dict: OrderedDict, root_key_list: Optional[List[str]]=None) -> OrderedDict:
if not root_key_list:
return the_dict
elif len(root_key_list) > 1:
return get_to_root_in_dict(the_dict[root_key_list[0]], root_key_list[1:])
else:
return the_dict[root_key_list[0]]
def read_xml_as_dataframe(xml: str, root_key_list: Optional[List[str]]=None, transpose: bool=False) -> pd.DataFrame:
if transpose:
return pd.DataFrame(get_to_root_in_dict(xmltodict.parse(xml), root_key_list)).T
else:
return pd.DataFrame(get_to_root_in_dict(xmltodict.parse(xml), root_key_list))
I've also notice that XML being just a text file, they sometime come within zips or even double zips...
So my makeshift solution (pandas-read-xml package) makes the "wrapper" read_xml()
a generic xml string or url or zip or double zip reader. I'm sure the pandas io people are much better at such things and making a more consistent and considered API design, so my project remains a temporary solution for me.
Another thing I've had issues with XML data was that, unlike JSON data, the pandas json_normalize didn't always work well with the intended XML schema. In JSON, a single element can exist within a list, but in XML, if only one tag exists, then the xml to dict process does not put that element into a list. So the resulting dataframe ends up with a single column with some lists and some strings, and some nans. So my package had to include a flatten()
method that made the json_normalize()
and explode()
more suitable for XML originating dataframes.
Anyways, thanks again for picking this up!
@minchulkim87 - thanks! Your work is interesting and a convenience handler for those unfamiliar with the XML document model. I aim to focus on the XML standard with DOM support of elements, attributes, namespaces, even XPath and XSLT using XML handlers in Python's standard library and few core external libraries that pandas
supports such as numpy
and lxml
. I believe pandas
intends for less reliance on third-party packages to avoid dependencies that can break APIs. Please standby for updates!
I'm sure this has been asked before but I can't find it. Why doesn't pandas have a method called read_xml?