propublica / upton

A batteries-included framework for easy web-scraping. Just add CSS! (Or do more.)
MIT License
1.62k stars 113 forks source link

find by xpath #18

Closed abacha closed 11 years ago

abacha commented 11 years ago

is it possible to do something like that:

page = Upton::Scraper.new(url)
page.find_by_xpath("//body/div/a").value
jeremybmerrill commented 11 years ago

Hi @abacha,

Yes, Upton supports searching by XPath.

If you had an index page ( = a page with links you want to scrape), you could do something like this:

scraper = Upton::Scraper.new(url, "//body/div/a")
scraper.scrape do | instance_html, instance_url, instance_index|
   puts "The title of the page at #{instance_url} is #{Nokogiri::HTML(instance_html).title}"
end

Thanks to #11, you can use XPath or CSS selectors interchangeably.

abacha commented 11 years ago

I wish I could do it in a simple way like I've demonstrated.. I need to do lots of searches through different xpath's in the same url

jeremybmerrill commented 11 years ago

Is the value of the content specified by the XPath expression another link to be scraped? Or just data you want to access?

And do you have lots of pages, or just one page to be scraped?

jeremybmerrill commented 11 years ago

If you just want to scrape lots of data from one page, just use Nokogiri. (Upton uses Nokogiri for HTML parsing.)

Nokogiri(Net::HTTP.get(URI(url)).xpath("//body/div/a").text
jeremybmerrill commented 11 years ago

Were you able to find a solution, @abacha?