taganaka / polipus

Polipus: distributed and scalable web-crawler framework
MIT License
92 stars 32 forks source link

SocketError could mean, domain is gone or no internet connection #50

Open tmaier opened 10 years ago

tmaier commented 10 years ago

When you try to resolve a domain which does not exist, polipus creates an error page with SocketError.

Actually, the page does not exist anymore. So it's like a 404 error. Just on DNS level.

But at the same time, SocketError will be raised if the internet connection got lost for any reason.

So to be sure, the site is gone, we would need a method like this

    def internet_connection_available?
      Excon.head('http://www.google.com')
      logger.debug { 'Webpage not available anymore' }
      true
    rescue Excon::Errors::SocketError
      logger.error { 'Internet connection lost' }
      false
    end

Or maybe even better, something like this: http://stackoverflow.com/questions/2385186/check-if-internet-connection-exists-with-ruby/22837368#22837368

I use it like this:

        crawler.on_page_error do |page|
          page.storable = false
          webpage_gone = page.error.is_a?(SocketError) && internet_connection_available?
          crawler.add_to_queue(page) unless page.not_found? || webpage_gone
        end

shall we add something for this case directly to polipus?