Documentation Needed: Explore an xml file with xml2

CerebralMastication commented 4 years ago

I hardly ever use xml files. But I'm no stranger to R or nested hierarchies (like R's lists). Today I ended up with a big xml file in my lap that I wanted to explore. I thought, "oh, I think I should try xml2 for understanding this!"

Um... I struggled with this for a few hours and I found the package and the documentation completely impenetrable because I don't speak the xml jargon. I am really confident that if a user understands xml conceptually and has used other xml tools then {xml2} is super useful. That's not me.

So here's my proposal: A vignette on exploring a random xml file using {xml2} with examples of how to tell what's in the file, how to pull elements out, how to pull elements out and pop them in a data frame, etc. A brief introduction to xml extraction for folks who are used to dealing with data frames, if you will.

FWIW, the data I was wrestling with was this 1.5gb xml file of music artists from discogs.com: https://discogs-data.s3-us-west-2.amazonaws.com/data/2019/discogs_20191201_artists.xml.gz

CerebralMastication commented 4 years ago

This is really close to what I’m thinking of (at least the first half):

https://lecy.github.io/Open-Data-for-Nonprofit-Research/Quick_Guide_to_XML_in_R.html

I see from this a bit of my "impedance mismatch" that's causing me mental angst when trying to understand {xml2}... and not surprisingly it has to do with vectors vs. atomic values. Most of the discussions around XML are geared towards yanking out one item (i.e. a specific node or xpath). I have never wanted to do this even once. I always want to extract huge swaths of the data to answer complex (or even simple) questions about whole sets of things. I realize, after hours of reading, that {xml2} has some great tools for helping extract whole vectors of things out of xml. I think an intro vignette could help the newcomer to {xml2} make sense of these concepts.

btw, I'm catching ideas here so it will be very public, but I fully intend to write this vignette or at least write a good starting point...

atroiano commented 4 years ago

I agree. It's pretty good timing for this too because I am working on extracting huge amounts of data from credit reports that are stored in XML, to the tune of ~ 1.2 K columns which I am parsing into a database. I ended up just using trial and error but happy to help contribute to sections on the topic as well. This week I am getting into a more complicated schema for other credit-related data while trying to follow this post https://github.com/jennybc/manipulate-xml-with-purrr-dplyr-tidyr/blob/master/README.R.

atroiano commented 4 years ago

@CerebralMastication I have some code that takes some stuff from this https://github.com/dantonnoriega/xmltools/blob/master/R/xml_to_df.R and throws all the terminal nodes into a long tibble and saves the nodes it traverses to get to the terminal values into columns. Once it's created, you can write custom functions to splice the different nodes into wide tibble. I am going to throw it into github after the holidays. I was having massive issues with parsing XML files that have ~ 21 different sections with mixes of arrays and values.

r-lib / xml2

Documentation Needed: Explore an xml file with xml2 #282