propublica / upton

A batteries-included framework for easy web-scraping. Just add CSS! (Or do more.)
MIT License
1.61k stars 112 forks source link

The example in README.md does not work #29

Closed paos closed 10 years ago

paos commented 10 years ago

The example given in the README.md on the frontpage is not working, it returns the html in its entirety.

scraper = Upton::Scraper.new("http://www.propublica.org", "section#river section h1 a")
scraper.scrape do |article_string|
  puts "here is the full text of the ProPublica article: \n #{article_string}"
  #or, do other stuff here.
end
jeremybmerrill commented 10 years ago

Hi Pål,

I think my explanation of the example is the problem. The expected output is the html of the article page (not thearticle content itself). The point is just to demonstrate how easily pages linked to by the index page ("instance pages") are scraped, leaving you to do whatever you want with them in the block. I will clarify; thanks for pointing this out!!

If you wanted to get just the article content for the site in the example, you could do something like this:

scraper = Upton::Scraper.new("http://www.propublica.org", "section#river section h1 a")
scraper.scrape do |article_string|
  paragraphs = Nokogiri::HTML(article_string).search("div.article p")
  article_text = paragraphs.map(&:text).join("\n")
  puts "here is the full text of the ProPublica article: \n #{article_text}"
  #or, do other stuff here.
end
jeremybmerrill commented 10 years ago

I should note, Pål, that if you're looking for a tool to automatically extract article content from an HTML page without per-source configuration (which may be what you expected Upton to do based on my poorly-worded README), I've had success with Boilerpipe.