r-lib / xml2

Bindings to libxml2
https://xml2.r-lib.org/
Other
220 stars 82 forks source link

Improve performance #394

Open mgirlich opened 1 year ago

mgirlich commented 1 year ago

I use the paws package to work with S3, e.g. list objects in a bucket. As this took quite a lot of time I did some profiling and noticed most of the time is spend in parsing the XML response (it uses/used as_list()). I created a PR (paws-r/paws#621) that improves the performance quite a bit but is still really slow (like 90% of the time is spend in parsing). To further improve the performance without trying to use/abuse xpath further, it is probably easier to improve the performance of xml2 in general.

D3SL commented 1 year ago

I work with XML files constantly and ran into this exact issue earlier this year as well. XML2 takes roughly a minute to extract data from a ~350kb-1.5mb xml file into a dataframe. For comparison I can process 600 files in the same amount of time by reading the file as a single column table with fread(), reformatting each row with stringr, flattening the table to a JSON string, converting it to a json and then back to a table, and then going through a series of unnest_wider and unnest_longer operations to populate parent data to child nodes.