Open mmeierer opened 9 years ago
See getNodeSet() and also xmlEventParse() Using xmlAttrsToDataFrame() you are doing more work than you need. But using getNodeSet() more effectively and efficiently in terms of what results you extract would be a lot better.
Thanks for your feedback.
getNodeSet() is only applied after "xmlParse", right? But having a 40GB XML file it would already crash when doing "xmlParse".
Perhaps I misunderstand something. Do you have by any chance an example for such a case?
40GB is large, but xmlParse() can surprise and parse large files. Did it actually fail on that file?
xmlEventParse() is the thing to use. It is a lot more involved than xmlParse() as you have to specify how to extract the information you want and of course that is specific to your problem. You can also use a hybrid approach of xmlEventParse() and creating a sub-tree at each of the nodes of interest.
There are examples in https://github.com/omegahat/XML/tree/master/myTests
I just tried it again with a 30GB XML file. Again, xmlParse fails on my both a 8 and 16GB RAM machine. The error is always "Run out of memory".
The frustrating thing is that I only need so few information from that XML. Thus, my question on a related option to the fread(file, select=c("Column1", "Column2") command in the data.table package. I thought that this might be common operation these days.
It is easy to extract specific columns from tabular data. You can do it with a shell command. XML is much richer than tabular data and so specifying which elements to extract by a simple name is ambiguous in general. Even if you could identify the elements of interest unambiguously by name, one has to express what to do with its sub-tree. If the elements are simple text, that is one thing. But what of the attributes. This is why xmlEventParse() exists. You can specify what you want to process and how.
Thanks for your comment. I tried different possibilities incl. xmlEventParse (which is - at least from what I observed - considerably slower than calling xmlParse) and a version where I did apply xmlParse on every single row with data.table (after reading in the XML as a text file with fread). However, both solutions did only work with small XML files (around 1-2 GB). Thus, I decided to split the 40GB file up in chunks of 1 GB. Iterating over those files seems to be good option as the performance of xmlParse is really great. However, with this approach I ran into the memory leak problem as reported on SO, e.g. http://stackoverflow.com/questions/23696391/memory-leak-when-using-package-xml-on-windows. With neither combination of rm(), gc(), or free(), I was able to clear the memory after each iteration. Do you have any hints how to resolve that issue?
Here an additional note to my previous comment: On a Mac, the memory issues are non-existent. Here the aforementioned combination of rm(), gc(), and free() seems to work.
FYI I wrote a tool SAX2CSV in C++ that can extract arbitrary attributes from "flat" XML files and uses SAX - event driven parsing - and since it is pure C++ is much faster than SAX in R.
github.com:dsidavis/SAX2CSV
Below is an example where I read in and parse and XML file. After storing the data in a data table, I delete many columns.
It would be very helpful to have an option to parse XML files selectively, i.e. files that are too big for direct processing (because of memory restrictions) could then be loaded directly by only reading in the data that is needed for further analyses.