Open mgirlich opened 1 year ago
I work with XML files constantly and ran into this exact issue earlier this year as well. XML2 takes roughly a minute to extract data from a ~350kb-1.5mb xml file into a dataframe. For comparison I can process 600 files in the same amount of time by reading the file as a single column table with fread()
, reformatting each row with stringr, flattening the table to a JSON string, converting it to a json and then back to a table, and then going through a series of unnest_wider
and unnest_longer
operations to populate parent data to child nodes.
I use the paws package to work with S3, e.g. list objects in a bucket. As this took quite a lot of time I did some profiling and noticed most of the time is spend in parsing the XML response (it uses/used
as_list()
). I created a PR (paws-r/paws#621) that improves the performance quite a bit but is still really slow (like 90% of the time is spend in parsing). To further improve the performance without trying to use/abuse xpath further, it is probably easier to improve the performance of xml2 in general.