sparklemotion / nokogiri

Nokogiri (鋸) makes it easy and painless to work with XML and HTML from Ruby.
https://nokogiri.org/
MIT License
6.13k stars 897 forks source link

JRuby XML::Reader memory performance is poor #2224

Open akimd opened 3 years ago

akimd commented 3 years ago

Hi,

In the context of a Rails application, I have to process huge XML documents that are "flat". I mean, they could just have been CSV documents instead of XML, but the source provides only XML.

While it appears to work well in MRI, with jruby the memory consumption is very high, and at some point the process is stuck (out of memory).

The following stupid script mimics the problem I face:

p = Pathname.new('big.xml')
n = 10_000_000
ping = -> (msg) { puts "#{Time.now}: #{msg}" }

p.open('w') { |f|
    f.puts "<foos>"
    n.times{ f.puts "  <foo>Hello World</foo>" }
    f.puts "</foos>"
}

ping['before']
c = 0
Nokogiri::XML.Reader(p.open).each do |node|
    ping[c] if c % 1_000_000 == 0
    c += 1
end
ping['after']

The documentation is somewhat ambiguous on how XML::Reader works. It is easy to understand "The Reader parser is good for when you need the speed of a SAX parser, but do not want to write a Document handler." as meaning "this is a SAX parser with a thin interface on top to make it easier than dealing with SAX yourself".

However the first node return by XML::Reader has the whole document as inner_xml, so I am wondering if XML::Reader is really SAX.

What we need in a document that looks like

<foos>
  <foo>...</foo>
  <foo>...</foo>
  <foo>...</foo>
  ...
  <foo>...</foo>
<foos>

is to iterate just on the entries. What is the recommendation in such a case?

Thanks a lot for Nokogiri

flavorjones commented 3 years ago

Hi @akimd, thanks for asking this question.

I want to spend a little bit of time understanding the memory performance of this example in JRuby -- based on your description, it sounds like perhaps there's a memory leak in the JRuby implementation that we might be able to fix.

The Reader class is based on libxml2's xmlreader module. Although libxml2 uses a SAX-ish pasrser at the heart of its implementation, the API is specialized, and it is optimized for the memory pattern of exposing only a "cursor" as it encounters each node.

The JRuby implementation does not have a low-level parser abstraction like libxml2's xmlreader, and so it's emulating that API ... I'm not very familiar with this particular corner of the JRuby implementation (it's had many hands in it over the years, none of them mine) but it looks like it's using the standard SAX parser provided by Xerces, plus some wrapper logic to present a Reader cursor.

I have some ideas on where the issue might be, and it's probably in the JRuby Reader wrapper. I will dig in and see if I can figure it out.

In the meantime, if you are willing to take on the additional complexity of writing SAX parser handlers, you should find the memory performance of the SAX parser acceptable.

akimd commented 3 years ago

Hi Mike, Thanks a lot for the quick response. Ok, so you do confirm that with respect to ressource consumption, XML::Reader is definitely expected to behave more like a nice and comfy SAX reader than a DOM one. That's reassuring. So that probably means that using something like inner_xml is asking for trouble that fire the parsing of all the remainder of the file (we don't do that in the real case, it's just something I encountered when toying with the artificial example above). Other team members are currently trying to address this issue using other parsers, but that causes other problems. I have no idea what the final choice will be, but we will watching change in Nokogiri on this regard.

Thanks again!

flavorjones commented 3 years ago

with respect to ressource consumption, XML::Reader is definitely expected to behave more like a nice and comfy SAX reader than a DOM one

That's correct, to the best of my knowledge! If it's not doing that then we should fix it; or else I need to understand the low-level implemention of libxml2 better.

flavorjones commented 3 years ago

I would love some help with this from any of the folks who are familiar with the JRuby implementation.

akimd commented 3 years ago

Hi guys, FWIW, we have fully converted our tool to using the SAX parser only. Cheers!

flavorjones commented 3 years ago

For posterity: this isn't the first issue filed about the memory utilization of Reader in JRuby -- see also #1066.

flavorjones commented 3 years ago

See #831 for another instance when we did work to try to improve memory usage.