scrapy-plugins / scrapy-zyte-smartproxy

Zyte Smart Proxy Manager (formerly Crawlera) middleware for Scrapy

BSD 3-Clause "New" or "Revised" License

356 stars 88 forks source link

Add exponential backoff when No Available Proxies #70

Closed hcoura closed 5 years ago

hcoura commented 5 years ago

Recent internal problems when provisioner restart causing spiders to stop working due to been considered bans were fixed on https://github.com/scrapy-plugins/scrapy-crawlera/pull/64 but another issue is that, at the same time scrapy behavior will be to retry with no delay until proxies are available.

This is supposed to help with that by introducing exponential backoff should it encounter No Available Proxies, now that I am writing this I am not sure it makes sense, or if we just rather have a bigish delay when that happens? Please opinions.

Also if going with this, I would like opinions on what should be the default values and if we should actually put some as settings.

codecov[bot] commented 5 years ago

Codecov Report

Merging #70 into master will increase coverage by 0.59%. The diff coverage is 100%.

@@            Coverage Diff             @@
##           master      #70      +/-   ##
==========================================
+ Coverage   94.66%   95.26%   +0.59%     
==========================================
  Files           2        3       +1     
  Lines         150      169      +19     
==========================================
+ Hits          142      161      +19     
  Misses          8        8

Impacted Files	Coverage Δ
scrapy_crawlera/middleware.py	`94.93% <100%> (+0.34%)`	:arrow_up:
scrapy_crawlera/utils.py	`100% <100%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update e8554a5...673a14f. Read the comment docs.

kmike commented 5 years ago

We have exponential backoff implemented for proxy checks in https://github.com/TeamHG-Memex/scrapy-rotating-proxies. It uses jitter, following analysis in https://aws.amazon.com/ru/blogs/architecture/exponential-backoff-and-jitter/.

No strong opinion from me on large delay vs exponential back-off though :) Backoff looks fine, because with exponential backoff + jitter you should be recovering from "no proxy" errors faster after an error is resolved on Crawlera side.

hcoura commented 5 years ago

I think this is the deal exponential backoff + jitter you should be recovering from "no proxy" errors faster after an error is resolved on Crawlera side. if no one has strong opinions about it I should do a final implementation next week.

jesuslosada commented 5 years ago

Is this error different from noslaves? If so, shouldn't we includenoslaves too?

hcoura commented 5 years ago

Oh Thanks @jesuslosada the doc on https://doc.scrapinghub.com/crawlera.html is actually outdated, source shows the error is noslaves, will update the PR and source accordingly.

baitxo commented 5 years ago

I like it. Just two things:

Make all parameters configurable (i.e. CRAWLERA_BACKOFF_STEP and CRAWLERA_BACKOFF_MAX)
Implement delay as an iterator so you don't need to keep track of current attempt on client (this is very optional)