Open akimd opened 3 years ago
Hi @akimd, thanks for asking this question.
I want to spend a little bit of time understanding the memory performance of this example in JRuby -- based on your description, it sounds like perhaps there's a memory leak in the JRuby implementation that we might be able to fix.
The Reader class is based on libxml2's xmlreader module. Although libxml2 uses a SAX-ish pasrser at the heart of its implementation, the API is specialized, and it is optimized for the memory pattern of exposing only a "cursor" as it encounters each node.
The JRuby implementation does not have a low-level parser abstraction like libxml2's xmlreader
, and so it's emulating that API ... I'm not very familiar with this particular corner of the JRuby implementation (it's had many hands in it over the years, none of them mine) but it looks like it's using the standard SAX parser provided by Xerces, plus some wrapper logic to present a Reader cursor.
I have some ideas on where the issue might be, and it's probably in the JRuby Reader wrapper. I will dig in and see if I can figure it out.
In the meantime, if you are willing to take on the additional complexity of writing SAX parser handlers, you should find the memory performance of the SAX parser acceptable.
Hi Mike,
Thanks a lot for the quick response. Ok, so you do confirm that with respect to ressource consumption, XML::Reader is definitely expected to behave more like a nice and comfy SAX reader than a DOM one. That's reassuring. So that probably means that using something like inner_xml
is asking for trouble that fire the parsing of all the remainder of the file (we don't do that in the real case, it's just something I encountered when toying with the artificial example above).
Other team members are currently trying to address this issue using other parsers, but that causes other problems. I have no idea what the final choice will be, but we will watching change in Nokogiri on this regard.
Thanks again!
with respect to ressource consumption, XML::Reader is definitely expected to behave more like a nice and comfy SAX reader than a DOM one
That's correct, to the best of my knowledge! If it's not doing that then we should fix it; or else I need to understand the low-level implemention of libxml2 better.
I would love some help with this from any of the folks who are familiar with the JRuby implementation.
Hi guys, FWIW, we have fully converted our tool to using the SAX parser only. Cheers!
For posterity: this isn't the first issue filed about the memory utilization of Reader in JRuby -- see also #1066.
See #831 for another instance when we did work to try to improve memory usage.
Hi,
In the context of a Rails application, I have to process huge XML documents that are "flat". I mean, they could just have been CSV documents instead of XML, but the source provides only XML.
While it appears to work well in MRI, with jruby the memory consumption is very high, and at some point the process is stuck (out of memory).
The following stupid script mimics the problem I face:
The documentation is somewhat ambiguous on how XML::Reader works. It is easy to understand "The Reader parser is good for when you need the speed of a SAX parser, but do not want to write a Document handler." as meaning "this is a SAX parser with a thin interface on top to make it easier than dealing with SAX yourself".
However the first node return by XML::Reader has the whole document as inner_xml, so I am wondering if XML::Reader is really SAX.
What we need in a document that looks like
is to iterate just on the entries. What is the recommendation in such a case?
Thanks a lot for Nokogiri