Use content-type to skip non-HTML instance pages

swapab commented 10 years ago

Am trying to scrape all the links on a site. So for example I tried -

u = Upton::Scraper.new("http://getbootstrap.com/2.3.2/", "a", :css)
u.verbose=true
u.sleep_time_between_request=0

Then it gives encoding error on

Cache of http://getbootstrap.com/2.3.2/assets/bootstrap.zip unavailable. Will download from the internet
Downloading from http://getbootstrap.com/2.3.2/assets/bootstrap.zip
Downloaded http://getbootstrap.com/2.3.2/assets/bootstrap.zip
Writing http://getbootstrap.com/2.3.2/assets/bootstrap.zip data to the cache

Stack Trace

Encoding::UndefinedConversionError: "\xBE" from ASCII-8BIT to UTF-8
    from /home/ubuntu-12-10/.rvm/gems/ruby-2.0.0-p195@scraper/bundler/gems/upton-011ff8ceef17/lib/upton/downloader.rb:86:in `write'
    from /home/ubuntu-12-10/.rvm/gems/ruby-2.0.0-p195@scraper/bundler/gems/upton-011ff8ceef17/lib/upton/downloader.rb:86:in `download_from_cache!'
    from /home/ubuntu-12-10/.rvm/gems/ruby-2.0.0-p195@scraper/bundler/gems/upton-011ff8ceef17/lib/upton/downloader.rb:33:in `get'
    from /home/ubuntu-12-10/.rvm/gems/ruby-2.0.0-p195@scraper/bundler/gems/upton-011ff8ceef17/lib/upton.rb:221:in `get_page'
    from /home/ubuntu-12-10/.rvm/gems/ruby-2.0.0-p195@scraper/bundler/gems/upton-011ff8ceef17/lib/upton.rb:315:in `get_instance'
    from /home/ubuntu-12-10/.rvm/gems/ruby-2.0.0-p195@scraper/bundler/gems/upton-011ff8ceef17/lib/upton.rb:332:in `block in scrape_from_list'
    from /home/ubuntu-12-10/.rvm/gems/ruby-2.0.0-p195@scraper/bundler/gems/upton-011ff8ceef17/lib/upton.rb:331:in `each'
    from /home/ubuntu-12-10/.rvm/gems/ruby-2.0.0-p195@scraper/bundler/gems/upton-011ff8ceef17/lib/upton.rb:331:in `each_with_index'
    from /home/ubuntu-12-10/.rvm/gems/ruby-2.0.0-p195@scraper/bundler/gems/upton-011ff8ceef17/lib/upton.rb:331:in `each'
    from /home/ubuntu-12-10/.rvm/gems/ruby-2.0.0-p195@scraper/bundler/gems/upton-011ff8ceef17/lib/upton.rb:331:in `map'
    from /home/ubuntu-12-10/.rvm/gems/ruby-2.0.0-p195@scraper/bundler/gems/upton-011ff8ceef17/lib/upton.rb:331:in `scrape_from_list'
    from /home/ubuntu-12-10/.rvm/gems/ruby-2.0.0-p195@scraper/bundler/gems/upton-011ff8ceef17/lib/upton.rb:177:in `block in scrape_to_csv'
    from /home/ubuntu-12-10/.rvm/rubies/ruby-2.0.0-p195/lib/ruby/2.0.0/csv.rb:1266:in `open'
    from /home/ubuntu-12-10/.rvm/gems/ruby-2.0.0-p195@scraper/bundler/gems/upton-011ff8ceef17/lib/upton.rb:175:in `scrape_to_csv'

jeremybmerrill commented 10 years ago

As you're probably aware, it looks like the link it's failing on is a zip archive, not an HTML page.

I think Upton probably should skip non-html pages by default; we can, I think, skip responses where "Content-Type" header is "application/zip" or any number of other non-html types. Do you think that makes sense as a way to deal with this, @swapnilabnave?

Renaming the issue.

jeremybmerrill commented 10 years ago

Though to be honest, I can't reproduce the issue. It works fine for me with a stock copy of Upton.

You can try out the content-types branch and see if that works for you. That'll skip the .zip page on the basis of its content type.

jeremybmerrill commented 10 years ago

Hey @swapnilabnave , have you had a chance to try this out again? I'd love to push a fix, if necessary.

jeremybmerrill commented 10 years ago

Happy to reopen this, @swapnilabnave, if you get a chance to test out my fix. But since I can't replicate the error, I'm closing this for now.

propublica / upton

Use content-type to skip non-HTML instance pages #22