taganaka / polipus

Polipus: distributed and scalable web-crawler framework
MIT License
92 stars 32 forks source link

On page error #29

Closed taganaka closed 10 years ago

coveralls commented 10 years ago

Coverage Status

Coverage increased (+0.01%) when pulling 1cf986354255e8428890c8df97e4e8d66b20f989 on on_page_error into 637c70c3ccc80e73282a59ca702067d29d18234e on master.

coveralls commented 10 years ago

Coverage Status

Coverage increased (+0.01%) when pulling 5344b070f871fb9993f6a273ce306827d7b66675 on on_page_error into 637c70c3ccc80e73282a59ca702067d29d18234e on master.

coveralls commented 10 years ago

Coverage Status

Coverage increased (+0.18%) when pulling ba5876e156569ea199ad1964e326b16c9472b069 on on_page_error into 637c70c3ccc80e73282a59ca702067d29d18234e on master.

tmaier commented 10 years ago

:+1:

taganaka commented 10 years ago

Not sure if at this point we should skip the call to on_page_downloaded when a page has error. Kinda of breaking change but more consistent though

:question:

tmaier commented 10 years ago

Well.... There is no header and no body... This will most likely lead to errors when processing the page with xpath or other ways. Second, the method is called on_page_downloaded. So one should expect a page downloaded.

But on the other hand, you might have code you want to run always. And the method is not called on_page_success or on_page_sucesssfully_downloaded...

So I think you are right. We should run on_page_downloaded regardless of the error.

I would just wish for an example which shows the possibilities of error handling with polipus.

As a side note: I just thought about @on_before_save.each {|e| e.call(page)} Should it run right before we store the page? And thus, skip it when storable? returns false? My conclusion was the same. The on_before_save block is the only place where one could manipulate storable? if necessary

taganaka commented 10 years ago

I agree we should not change the current workflow. on_page_error is a better semantic way to encapsulate network errors. A better documentation plus more comprehensive examples should help the user to understand better how to proper use DSL exposed.

re on_before_save : agree to run it just before page store and after on_page_error

coveralls commented 10 years ago

Coverage Status

Coverage increased (+0.18%) when pulling ad44653fedbff91a80cff7c73054f409a8e84e27 on on_page_error into 637c70c3ccc80e73282a59ca702067d29d18234e on master.

coveralls commented 10 years ago

Coverage Status

Coverage increased (+0.21%) when pulling 68a2e4315cfa3d3545b1a9fbc79a59dd291c10ed on on_page_error into 637c70c3ccc80e73282a59ca702067d29d18234e on master.

coveralls commented 10 years ago

Coverage Status

Coverage increased (+0.21%) when pulling 68a2e4315cfa3d3545b1a9fbc79a59dd291c10ed on on_page_error into 637c70c3ccc80e73282a59ca702067d29d18234e on master.