sul-dlss / common-accessioning

Suite of robots that handle the tasks of accessioning digital objects
Other
2 stars 1 forks source link

reduce nokogiri memory consumption #1304

Closed peetucket closed 2 weeks ago

peetucket commented 2 weeks ago

Why was this change made? 🤔

Nokogiri consumption is very large for big XML docs (240 MB doc --> 7 GB in memory nokogiri doc).

This change streams the XML doc in to find the nodes and never loads it fully into memory. Thus it should scale for arbitrary large files (it just will take longer for larger files but should not blow up). It reduces memory usage to something that isn't even noticeable and takes about the same amount of time (sub 10 seconds laptop, about 30 seconds server)

There is a slight change to the structure of the XML produced which I am trying to determine if it matters or not (basically, there are some extra namespace URIs added to the nodes within the page docs)

https://argo-qa.stanford.edu/view/druid:gq110sz4835

How was this change tested? 🤨

Server and laptop and integration tests on QA (working object: https://sul-purl-stage.stanford.edu/zd178rm8212)

peetucket commented 2 weeks ago

Unfortunately after some more testing, it didn't make a big difference...so working on this some more.

justinlittman commented 2 weeks ago

Do you need to use Nokogiri's sax parser?

peetucket commented 2 weeks ago

The repeated namespace URI seemed to be added by the nokogiri reader when pulling nodes out via outer_html and there wasn't an obvious way to remove it. It's definitely not needed but doesn't seem to do any harm based on my test.