sparklemotion / nokogiri

Nokogiri (鋸) makes it easy and painless to work with XML and HTML from Ruby.
https://nokogiri.org/
MIT License
6.14k stars 899 forks source link

allow XPath within XmlReader #256

Open flavorjones opened 14 years ago

flavorjones commented 14 years ago

From an email from Devlin Daley to nokogiri-talk:

If I had any C extension fu I would add what I think is an awesome approach to Nokogiri for parsing large xml files. The bummer about going to SAX or a reader is that you lose xpath and css selectors. The compromise is to restrict yourself to only forward-looking xpath expressions and register those xpaths. Combine that with a Reader or pull-parser, and you just ask for the next matching element. This is explained by Dare Obsajango: http://msdn.microsoft.com/en-us/library/ms950778.aspx

On a similar note, LibXML2 has a method called expand() explained at the bottom of this page titled "Mixing the reader and tree or XPath operations", that, when you find the start element of the node you're looking for, you can call expand to get a sub-document as a DOM for running selectors on just the interesting subset of the document. http://xmlsoft.org/xmlreader.html

The other ruby libxml library exposes this method in their Reader, but whenever I tried to use it I ran into memory problems and crashes. http://libxml.rubyforge.org/rdoc/classes/LibXML/XML/Reader.html#M000353

byrnejb commented 14 years ago

I am running into this very difficulty at the moment. It would be SOOOO much more convenient if the Reader object had an xpath method that essentially created a mini xml document out of the object and allowed XPath and CSS selectors within that subtree.

jmbrink26 commented 14 years ago

I definitely vote in favor of this. This would make rendering large XML schema's much faster.

+1! :)

jbasdf commented 14 years ago

I agree.

+1

davidrichards commented 14 years ago

+1

bmidgley commented 14 years ago

+1

cayblood commented 14 years ago

+1

ewollesen commented 14 years ago

+1

flavorjones commented 14 years ago

OK. We hear you. Scheduling work on this for the 1.5 branch.

tenderlove commented 14 years ago

Just looking in to this, and I have some thoughts:

This looks like an incredibly tricky feature to add, and here's why: from the xmlTextReaderExpand documentation:

Returns a node pointer valid until the next xmlTextReaderRead()

Imagine we had some ruby code like this:

nodes = []
reader = Nokogiri::XML.Reader(some_xml)
reader.each do |r|
  ...
  nodes << r.expand.xpath('.//whatever')
end

Any nodes that are exposed in the subtree returned from the expand call will be invalid pointers on the next iteration of the reader block. Thus our nodes list will contain a boat load of bad pointers.

If we're going to add this feature, we need to figure out a way to sandbox the entire subtree inside the iteration block. Otherwise, people are going to crash left and right.

tenderlove commented 14 years ago

I've pushed a branch with a commit that integrates the expand method. If you pull the branch and run the tests, you'll see it crash and burn:

http://github.com/tenderlove/nokogiri/tree/expand

byrnejb commented 14 years ago

Q. How, exactly, does what xpath returns differ from an xml document? Is there no way of wrapping a pair pseudo root tags around it and treating the result as an xml document?

flavorjones commented 14 years ago

Agree with @tenderlove. I've tried hacking his branch to:

  1. dup and root the subtree (with xmlDocCopyNode) to try to make it persistent
  2. create a new document, and copy the subtree to that new document

and both are crashing and burning.

To work around this, we'll need to spend some time understanding how memory interaction works between Reader, Document and Node within libxml; and even once we understand it, I'm not sure we'll be able to hack a workaround together inside Nokogiri.

It's late, and I'm tired. I'll look again with fresh eyes later.

kliuless commented 13 years ago

I think Reader#outer_xml is a workaround for expand(), but it's probably not as efficient to have to re-parse the string into a doc (after the reader already parsed it to provide the outer_xml).

The project at http://libxml.rubyforge.org/ seems to have found a fix. See closed issue 20117. However, there may be a memory leak (issue 26297).

flavorjones commented 13 years ago

Awesome. I don't know how you found it (serious googlechaeology?) but here are the deep links:

I'll take a look.

dsisnero commented 11 years ago

I think what libxml has for this is XmlPattern. From the perl bindings

use XML::LibXML;
  my $pattern = XML::LibXML::Pattern->new('/x:html/x:body//x:div', { 'x' => 'http://www.w3.org/1999/xhtml' });
  # test a match on an XML::LibXML::Node $node

  if ($pattern->matchesNode($node)) { ... }

  # or on an XML::LibXML::Reader

  if ($reader->matchesPattern($pattern)) { ... }

  # or skip reading all nodes that do not match

  print $reader->nodePath while $reader->nextPatternMatch($pattern);

  $pattern = XML::LibXML::Pattern->new( pattern, { prefix => namespace_URI, ... } );
  $bool = $pattern->matchesNode($node);

so if we can get LibXML::Pattern then we can continue use reader to quickly get where we want via a subset of xpath and then read from there.

felixbuenemann commented 6 years ago

FYI some performance numbers:

Parsing through a 4 GB XML and expanding 40,000 Nodes takes around 450 Seconds and 280 MB of RAM using nokogiri when creating a new doc from the outer xml and around 95 Seconds and 205 MB RAM using libxml-ruby with reader.expand.

So indeed xmlTextReaderExpand is much more efficient.

Maybe a way to discourage the usage of the expanded node outside the current iteration would be to use a block api:

Nokogiri::XML::Reader(file).each do |n|
  if n.depth == 2 && n.node_type == Nokogiri::XML::Reader::TYPE_ELEMENT && n.name == 'Product'
    # doc = Nokogiri::XML(n.outer_xml)
    n.expand do |doc|
      # do something with doc
    end
  end
end

There's probably no efficient way to avoid people from using the document outside the current iteration. The only thing I could think of is to wrap each document, node etc. that is accesses inside the block in something that raises an exception when accessed outside of the block.

I still think this feature would be worthwhile to have, since it's very useful for batch processing of large XML files where all the logic for extracting information can be handled inside a single read operation.

Another approach would be to call xmlTextReaderPreserve during expand and xmlTextReaderCurrentDoc before freeing the reader, but I'm not sure how well that would interact with garbage collection.

jrochkind commented 6 years ago

i'd still love this