I wrote this up to answer somebody's question in the past year, but I can't remember who or where. I think it's a good example of moving from a short-but-specific-solution to a longer-but-general-solution and hopefully teaches folks about custom XPath handlers and XPath queries along the way.
#! /usr/bin/env ruby
#
# TODO: put this in the nokogiri.org tutorials
#
require "nokogiri"
html = <<~EOF
<html>
<body>
<h1>My Fakepedia Page</h1>
<div id="bodyContent">
<div id="mw-content-text">
<div class="mw-parser-output">
<div id="toc">...</div>
<h2>Background</h2>
<p>This is uninteresting content and you don't want to scrape it.</p>
<h2>Good Stuff</h2>
<p>This is the good stuff.</p>
<p>You really want to scrape just this section.</p>
<h2>Unrelated Stuff</h2>
<p>This is where the author has gone off on a tangent.</p>
<h2>References</h2>
<p>Snoozapalooza.</p>
</div>
</div>
</div>
EOF
doc = Nokogiri::HTML(html)
#
# solution 1 - simple XPath, process results in Ruby
#
# i think you will agree, this is ugly code and makes a lot of
# implicit assumptions about the structure of the document.
#
# don't do this. better alternatives are provided below.
#
node_set = doc.css("div.mw-parser-output").children
# look forward until we get to the h2 that we want
start_index = 0
while !(node_set[start_index].name == "h2" && node_set[start_index].content == "Good Stuff")
start_index += 1
end
start_index += 1
# look forward until we get to the next h2
end_index = start_index
while node_set[end_index + 1].name != "h2"
end_index += 1
end
# slice the node set
puts node_set[start_index..end_index]
puts "-----"
#
# solution 2 - using an XPath function to perform set intersection
#
# this is much cleaner code, but still makes an assumption about the
# structure of the document.
#
# a better alternative is provided below
#
class XPathIntersection
def self.intersection(set1, set2)
set1 & set2 # in ruby, return the intersection of the NodeSets
end
end
xpath_query = <<~EOX
intersection(//h2[text()='Good Stuff']/following-sibling::*,
//h2[text()='Unrelated Stuff']/preceding-sibling::*)
EOX
puts doc.xpath(xpath_query, XPathIntersection)
puts "-----"
#
# solution 3 - write a method to introspect on the document and use
# more XPath queries to find the section boundary and return only the
# nodes within the section.
#
# note that it works:
# - for any header level (h1, h2, h3, et al)
# - even if the header is the last one in the section
# - only requires knowing the text of the header you care about
#
# it uses:
# - Node#path which returns an XPath query that points just to this node
# - Node#name which returns the tag of the node (e.g., "h2", "div")
#
class XPathHeaderSection
def self.header_section(node_set)
document = node_set.document
header = node_set.first
# grab siblings that follow the target header
following_siblings_query = "#{header.path}/following-sibling::*"
following_siblings = document.xpath(following_siblings_query)
# check if there's a next header of the same type that's a sibling
next_header_query = "#{header.path}/following-sibling::#{header.name}"
next_header = document.at_xpath(next_header_query)
if next_header
preceding_siblings_query = "#{next_header.path}/preceding-sibling::*"
preceding_siblings = document.xpath(preceding_siblings_query)
following_siblings & preceding_siblings # xpath intersection
else
following_siblings
end
end
end
puts XPathHeaderSection.header_section(doc.xpath("//h2[text()='Good Stuff']"))
# note that you can also call this method as an XPath function
puts doc.xpath("header_section(//h2[text()='Good Stuff'])", XPathHeaderSection)
I wrote this up to answer somebody's question in the past year, but I can't remember who or where. I think it's a good example of moving from a short-but-specific-solution to a longer-but-general-solution and hopefully teaches folks about custom XPath handlers and XPath queries along the way.