sparklemotion / nokogiri.org

Documentation site for Nokogiri (a ruby library)
https://nokogiri.org/
MIT License
46 stars 24 forks source link

add example to tutorials: how to grab an HTML section (between like headers) #28

Open flavorjones opened 4 years ago

flavorjones commented 4 years ago

I wrote this up to answer somebody's question in the past year, but I can't remember who or where. I think it's a good example of moving from a short-but-specific-solution to a longer-but-general-solution and hopefully teaches folks about custom XPath handlers and XPath queries along the way.

#! /usr/bin/env ruby
#
#  TODO: put this in the nokogiri.org tutorials
#

require "nokogiri"

html = <<~EOF
  <html>
  <body>
    <h1>My Fakepedia Page</h1>
    <div id="bodyContent">
      <div id="mw-content-text">
        <div class="mw-parser-output">
          <div id="toc">...</div>

          <h2>Background</h2>
          <p>This is uninteresting content and you don't want to scrape it.</p>

          <h2>Good Stuff</h2>
          <p>This is the good stuff.</p>
          <p>You really want to scrape just this section.</p>

          <h2>Unrelated Stuff</h2>
          <p>This is where the author has gone off on a tangent.</p>

          <h2>References</h2>
          <p>Snoozapalooza.</p>
        </div>
      </div>
    </div>
EOF

doc = Nokogiri::HTML(html)

#
#  solution 1 - simple XPath, process results in Ruby
#
#  i think you will agree, this is ugly code and makes a lot of
#  implicit assumptions about the structure of the document.
#
#  don't do this. better alternatives are provided below.
#
node_set = doc.css("div.mw-parser-output").children

# look forward until we get to the h2 that we want
start_index = 0
while !(node_set[start_index].name == "h2" && node_set[start_index].content == "Good Stuff")
  start_index += 1
end
start_index += 1

# look forward until we get to the next h2
end_index = start_index
while node_set[end_index + 1].name != "h2"
  end_index += 1
end

# slice the node set
puts node_set[start_index..end_index]
puts "-----"

#
#  solution 2 - using an XPath function to perform set intersection
#
#  this is much cleaner code, but still makes an assumption about the
#  structure of the document.
#
#  a better alternative is provided below
#
class XPathIntersection
  def self.intersection(set1, set2)
    set1 & set2 # in ruby, return the intersection of the NodeSets
  end
end

xpath_query = <<~EOX
  intersection(//h2[text()='Good Stuff']/following-sibling::*,
               //h2[text()='Unrelated Stuff']/preceding-sibling::*)
EOX

puts doc.xpath(xpath_query, XPathIntersection)
puts "-----"

#
#  solution 3 - write a method to introspect on the document and use
#  more XPath queries to find the section boundary and return only the
#  nodes within the section.
#
#  note that it works:
#  - for any header level (h1, h2, h3, et al)
#  - even if the header is the last one in the section
#  - only requires knowing the text of the header you care about
#
#  it uses:
#  - Node#path which returns an XPath query that points just to this node
#  - Node#name which returns the tag of the node (e.g., "h2", "div")
#
class XPathHeaderSection
  def self.header_section(node_set)
    document = node_set.document
    header = node_set.first

    # grab siblings that follow the target header
    following_siblings_query = "#{header.path}/following-sibling::*"
    following_siblings = document.xpath(following_siblings_query)

    # check if there's a next header of the same type that's a sibling
    next_header_query = "#{header.path}/following-sibling::#{header.name}"
    next_header = document.at_xpath(next_header_query)

    if next_header
      preceding_siblings_query = "#{next_header.path}/preceding-sibling::*"
      preceding_siblings = document.xpath(preceding_siblings_query)

      following_siblings & preceding_siblings # xpath intersection
    else
      following_siblings
    end
  end
end

puts XPathHeaderSection.header_section(doc.xpath("//h2[text()='Good Stuff']"))

# note that you can also call this method as an XPath function
puts doc.xpath("header_section(//h2[text()='Good Stuff'])", XPathHeaderSection)