vifreefly / kimuraframework

Kimurai is a modern web scraping framework written in Ruby which works out of box with Headless Chromium/Firefox, PhantomJS, or simple HTTP requests and allows to scrape and interact with JavaScript rendered websites
MIT License
1.01k stars 155 forks source link

How to handle Net::HTTPNotFound error? #3

Closed caecity43 closed 6 years ago

caecity43 commented 6 years ago
bundle exec kimurai console xxx_spider --url https://www.xxx.com/100

RuntimeError: Received the following error for a GET request to https://www.xxx.com/100: '404 => Net::HTTPNotFound for https://www.xxx.com/100 -- unhandled response'

Like this log, How can I handle this?

vifreefly commented 6 years ago

@caecity43 You can automatically retry any failed requests (which can happen while visiting some url) by providing retry_request_errors: option with errors which you want to retry:

require 'net/http'
require 'kimurai'

class ExampleSpider < Kimurai::Base
  @config = {
    browser: {
      retry_request_errors: [Net::HTTPNotFound]
    }
  }
end

Keep in mind that some engines (for example mechanize) can raise StandardError. In this case add it to retry_request_errors: retry_request_errors: [StandardError]. With it, any errors raised while requesting a page (request_to) will be automatically retried (by default there are 3 retries). If after 3 retries there is still an error, then exception will be raised. If you want to handle it, use standard Ruby's begin rescue block:

begin
  request_to :parse_product, url: some_url
rescue => e
  logger.error "There is failed request (#{e.inspect}), skipping it..."
end
caecity43 commented 6 years ago

@vifreefly Thanks, It works.