propublica / upton

A batteries-included framework for easy web-scraping. Just add CSS! (Or do more.)
MIT License
1.62k stars 113 forks source link

Nokogiri::CSS::SyntaxError: unexpected '$' after '' #30

Closed irosenb closed 10 years ago

irosenb commented 10 years ago

I've been trying to get my link working as the index_url for a while, and it hasn't been working.

s = Upton::Scraper.new("http://shops.oscommerce.com/directory?country=US&page=1")

s.scrape { |html| puts html } 

Then I get this error:


Nokogiri::CSS::SyntaxError: unexpected '$' after ''
from /Users/user/.rvm/gems/ruby-2.0.0-p247/gems/nokogiri-1.6.1/lib/nokogiri/css/parser_extras.rb:87:in `on_error'

I'm having difficulty debugging this. If I put this into an array Upton::Scraper.new(["http://shops.oscommerce.com/directory?country=US&page=1"]) then it works fine. But I would rather this gem handle the pagination for me.

Can anyone give me some direction as to why this is happening? I know this is an edge case, but I can't find what's causing it.

jeremybmerrill commented 10 years ago

Hi Isaac,

I need to write some better documentation and/or change the API to better account for this use case -- it's not an edge case, but a relatively central one, and it's one that keeps tripping people up. So please accept my apologies for this not being clearer.

But first, I'm not sure which of two central cases you're going for here: Are you trying to scrape the data off the pages linked from that directory, or the data on the directory page itself?

If you're trying to scrape data off the pages linked from the directory, you need to give a CSS or XPath selector as the second argument to Scraper.new, i.e. Scraper.new("http://whatever", "a.relevant-link"). You can paginate automatically by settingpaginationto true,pagination_paramto"page"andpagination_max` to whatever the max is.

irosenb commented 10 years ago

I'm trying to do both. This helped me solve the problem of scraping data off the pages linked from the directory. I wasn't aware that the Scraper object needed to target a link.

Now how do I then go about scraping the data on the directory page itself? I'm trying to get the uls on the page by using table + table ul li. Scraper seems to recognize that there are 12 instances but doesn't give me anymore info.


s = Upton::Scraper.new("http://shops.oscommerce.com/directory?country=US&page=1", "table + table ul li")

s.sleep_time_between_requests = 1
=> 1

s.verbose = true
=> true

s.scrape { |html| puts html }

-------

Stashing disabled. Will download from the internet.
Downloading from http://shops.oscommerce.com/directory?country=US&page=1
Downloaded http://shops.oscommerce.com/directory?country=US&page=1
sleeping 1 secs
Scraping 12 instances

=> [nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil]
jeremybmerrill commented 10 years ago

Hey Isaac,

To scrape just a single page, put the URL in an array, (I should really write some helpers for this so it's clearer and use them in the README (#TODO).) e.g.

s = Upton::Scraper.new(["http://website.com/whatever"])
s.scrape do |instance_html| #this block will only execute once -- with the HTML from the site you gave to Scraper.new
     page = Nokogiri::HTML(instance_html)
     page.search("table + table ul li").each{|li| puts li.text}
end

You can also use a helper like this:

s = Upton::Scraper.new(["http://website.com/whatever"])
my_list = s.scrape &Upton::Utils.list("table + table ul li")

or even better

my_list = Upton::Scraper.new(["http://website.com/whatever"]).scrape &Upton::Utils.list("table + table ul li")

which'll just return an array of the contents of the li elements.

There's no built-in way to scrape data from both the index page AND the instance page with one pass. You're not the first one to ask for it though. I think what I might do for the next release of Upton (#TODO) is have scrape yield an instance of InstancePage (which sounds awkward...) -- which would include the Nokogiri'd HTML, the plain HTML, the URL, and a reference to the Nokogiri'd index page, etc.