taganaka / polipus

Polipus: distributed and scalable web-crawler framework
MIT License
92 stars 32 forks source link

Internet connection lost; Page still stored and processed #15

Closed tmaier closed 10 years ago

tmaier commented 10 years ago

Today, I lost the internet connection. I got the following error message in my logs:

W, [2014-05-18T18:44:44.684461 #13198]  WARN -- Polipus: Page http://www.example.com/foobar has error: No route to host - connect(2) for "www.example.com" port 80

Polipus should not proceed processing a page in this case. It should requeue the page and proceed with the next page.

taganaka commented 10 years ago

I'm more prone to let the client to handle page download failures rather than apply a re-enqueue strategy blindly A built in naive failure retry strategy for temporary network failures is already in place (https://github.com/taganaka/polipus/blob/master/lib/polipus/http.rb#L170-L179)

A client can easily implement its own error handling workflow for more fine grain control:

polipus.on_page_downloaded do |page|
  add_to_queue page if page.error
end

I can add a new block handler like on_page_error {|page|...} so that the error handling logic can be encapsulated there

on_page_downloaded blocks is always called, no matter what. Introducing on_page_error handler should then skip on_page_downloaded call since the page has not been downloaded

tmaier commented 10 years ago

I think Page#storable? should return false if there was an error. Drawback is that this would hinder us to investigate an error Advantage would be that if we queue the url again, it would skip the url later

I like on_page_error as this would encapsulate the error handling on one place.

And one of the examples should show

polipus.on_page_error do |page|
  polipus.add_to_queue(page) if page.error
end
tmaier commented 10 years ago

I was wrong. The drawback mentioned does not exist, as we do not store the error message in the database.

taganaka commented 10 years ago

storing the page with the error associated into the DB could be useful for better offline debug. Having on_page_error blocks executed before[1] page.storable? is evaluated, users could easily modify the flow:

polipus.on_page_error do |page|
  page.storable = false
  polipus.add_to_queue(page)
end

However to store the error message into the DB, error attribute needs to be serialized and exposed into page.to_hash method

[1]https://github.com/taganaka/polipus/blob/on_page_error/lib/polipus.rb#L207

taganaka commented 10 years ago

implemented here: https://github.com/taganaka/polipus/pull/29

taganaka commented 10 years ago

Addressed into v 0.3.0

https://github.com/taganaka/polipus/releases/tag/0.3.0