Closed peetucket closed 2 weeks ago
Unfortunately after some more testing, it didn't make a big difference...so working on this some more.
Do you need to use Nokogiri's sax parser?
The repeated namespace URI seemed to be added by the nokogiri reader when pulling nodes out via outer_html
and there wasn't an obvious way to remove it. It's definitely not needed but doesn't seem to do any harm based on my test.
Why was this change made? 🤔
Nokogiri consumption is very large for big XML docs (240 MB doc --> 7 GB in memory nokogiri doc).
This change streams the XML doc in to find the nodes and never loads it fully into memory. Thus it should scale for arbitrary large files (it just will take longer for larger files but should not blow up). It reduces memory usage to something that isn't even noticeable and takes about the same amount of time (sub 10 seconds laptop, about 30 seconds server)
There is a slight change to the structure of the XML produced which I am trying to determine if it matters or not (basically, there are some extra namespace URIs added to the nodes within the page docs)
https://argo-qa.stanford.edu/view/druid:gq110sz4835
How was this change tested? 🤨
Server and laptop and integration tests on QA (working object: https://sul-purl-stage.stanford.edu/zd178rm8212)