Add option to load/parse XML selectively [feature request]

mmeierer commented 9 years ago

Below is an example where I read in and parse and XML file. After storing the data in a data table, I delete many columns.

It would be very helpful to have an option to parse XML files selectively, i.e. files that are too big for direct processing (because of memory restrictions) could then be loaded directly by only reading in the data that is needed for further analyses.

# Load Package 
library(XML)
library(data.table)

# Set up toy example 
xmlText  <- "<posts>
  <row Id='1' PostTypeId='1' AcceptedAnswerId='8' CreationDate='2012-12-11T20:37:08.823' Score='42' ViewCount='5761' Body='&lt;p&gt;Assuming the world in the One Piece universe is round, then there is not really a beginning or and end of the grand line.&lt;/p&gt;&#xA;&#xA;&lt;p&gt;The Straw Hats started out at the first half and now are sailing across the second half.&#xA;Wouldn.t it have been quicker to set sail in the opposite direction where they started?&lt;/p&gt;&#xA;' OwnerUserId='21' LastEditorUserId='88' LastEditDate='2013-06-20T03:28:03.750' LastActivityDate='2013-11-29T11:23:22.793' Title='The treasure in One Piece is at the end of the grand line. But isn.t that the same as the beginning?' Tags='&lt;one-piece&gt;' AnswerCount='4' CommentCount='0' />
  <row Id='2' PostTypeId='1' AcceptedAnswerId='33' CreationDate='2012-12-11T20:39:40.780' Score='10' ViewCount='161' Body='&lt;p&gt;In the middle of &lt;em&gt;The Dark Tournament&lt;/em&gt;, Yusuke Urameshi gets to fully inherit Genkai.s power of the &lt;em&gt;Spirit Wave&lt;/em&gt; by absorbing a ball of energy from her.&lt;/p&gt;&#xA;&#xA;&lt;p&gt;However, this process turns into an excruciating trial for Yusuke, almost killing him, and keeping him doubled over in extreme pain for a long period of time, so much so that his Spirit Animal, Poo, is also in pain and flies to him to try to help.&lt;/p&gt;&#xA;&#xA;&lt;p&gt;My question is, why is it such a painful procedure to learn and absorb this power?&lt;/p&gt;&#xA;' OwnerUserId='26' LastEditorUserId='247' LastEditDate='2013-02-26T17:02:31.570' LastActivityDate='2013-06-20T03:31:39.187' Title='Why does absorbing the Spirit Wave from Genkai involve such a painful process?' Tags='&lt;yu-yu-hakusho&gt;' AnswerCount='1' CommentCount='0' />
  <row Id='3' PostTypeId='1' AcceptedAnswerId='148' CreationDate='2012-12-11T20:42:47.447' Score='6' ViewCount='1468' Body='&lt;p&gt;In Sora no Otoshimono, Ikaros carries around a watermelon like a pet and likes watermelons and pretty much anything else round.  At one point she even has a watermelon garden and attacks all the bugs that get near the melons.&lt;/p&gt;&#xA;&#xA;&lt;p&gt;What.s the significance of the watermelon and why does she carry one around?&lt;/p&gt;&#xA;' OwnerUserId='29' LastActivityDate='2014-01-15T21:01:55.043' Title='What.s the significance of the watermelon in Sora no Otoshimono?' Tags='&lt;sora-no-otoshimono&gt;' AnswerCount='2' CommentCount='1' />
</posts>"

# Parse XML
doc <- xmlParse(xmlText, asText=TRUE)
r <- xmlRoot(doc)

# Convert XML to data table
# See: http://www.carlboettiger.info/2013/07/22/XML-parsing-strategies
d <- as.data.table(XML:::xmlAttrsToDataFrame(getNodeSet(r, path = "row")))

# Delete columns which are not used in further analyses
d[, c("Score", "ViewCount", "Body", "LastEditorUserId", "LastEditDate", 
      "LastActivityDate", "Title", "Tags", "AnswerCount", "CommentCount"):=NULL]

duncantl commented 9 years ago

See getNodeSet() and also xmlEventParse() Using xmlAttrsToDataFrame() you are doing more work than you need. But using getNodeSet() more effectively and efficiently in terms of what results you extract would be a lot better.

mmeierer commented 9 years ago

Thanks for your feedback.

getNodeSet() is only applied after "xmlParse", right? But having a 40GB XML file it would already crash when doing "xmlParse".

Perhaps I misunderstand something. Do you have by any chance an example for such a case?

duncantl commented 9 years ago

40GB is large, but xmlParse() can surprise and parse large files. Did it actually fail on that file?

xmlEventParse() is the thing to use. It is a lot more involved than xmlParse() as you have to specify how to extract the information you want and of course that is specific to your problem. You can also use a hybrid approach of xmlEventParse() and creating a sub-tree at each of the nodes of interest.

There are examples in https://github.com/omegahat/XML/tree/master/myTests

mmeierer commented 9 years ago

I just tried it again with a 30GB XML file. Again, xmlParse fails on my both a 8 and 16GB RAM machine. The error is always "Run out of memory".

The frustrating thing is that I only need so few information from that XML. Thus, my question on a related option to the fread(file, select=c("Column1", "Column2") command in the data.table package. I thought that this might be common operation these days.

duncantl commented 9 years ago

It is easy to extract specific columns from tabular data. You can do it with a shell command. XML is much richer than tabular data and so specifying which elements to extract by a simple name is ambiguous in general. Even if you could identify the elements of interest unambiguously by name, one has to express what to do with its sub-tree. If the elements are simple text, that is one thing. But what of the attributes. This is why xmlEventParse() exists. You can specify what you want to process and how.

mmeierer commented 9 years ago

Thanks for your comment. I tried different possibilities incl. xmlEventParse (which is - at least from what I observed - considerably slower than calling xmlParse) and a version where I did apply xmlParse on every single row with data.table (after reading in the XML as a text file with fread). However, both solutions did only work with small XML files (around 1-2 GB). Thus, I decided to split the 40GB file up in chunks of 1 GB. Iterating over those files seems to be good option as the performance of xmlParse is really great. However, with this approach I ran into the memory leak problem as reported on SO, e.g. http://stackoverflow.com/questions/23696391/memory-leak-when-using-package-xml-on-windows. With neither combination of rm(), gc(), or free(), I was able to clear the memory after each iteration. Do you have any hints how to resolve that issue?

mmeierer commented 9 years ago

Here an additional note to my previous comment: On a Mac, the memory issues are non-existent. Here the aforementioned combination of rm(), gc(), and free() seems to work.

duncantl commented 6 years ago

FYI I wrote a tool SAX2CSV in C++ that can extract arbitrary attributes from "flat" XML files and uses SAX - event driven parsing - and since it is pure C++ is much faster than SAX in R.

github.com:dsidavis/SAX2CSV

omegahat / XML

Add option to load/parse XML selectively [feature request] #3