scrapy-plugins / scrapy-zyte-smartproxy

Zyte Smart Proxy Manager (formerly Crawlera) middleware for Scrapy
BSD 3-Clause "New" or "Revised" License
357 stars 88 forks source link

Check Crawlera Error Header when checking bans #64

Closed VMRuiz closed 5 years ago

VMRuiz commented 5 years ago

Checking only the status code is not enough to know if the response coming from crawlera is a ban or not. I have added extra checks to prevent false positives.

hcoura commented 5 years ago

I like this, to be honest I thought that this was already in place. Since there are quite a lot of 503's that are not bans (https://doc.scrapinghub.com/crawlera.html#errors)

Thanks, could you please fix the broken tests so I can merge it and schedule a release.

pablohoffman commented 5 years ago

This is great for the Crawlera middleware, but I wonder how important is this semantic and whether most (non-Scrapy) Crawlera customers are aware of it?. I don't see any mention on the Best Practices, should there be?

I was under the impression that 503's were still the main way for Crawlera to signal domains banned / no proxies.

hcoura commented 5 years ago

@pablohoffman scrapy-crawlera doesn't do retries. It checks for bans for stats and for killing the spider should too many bans occur.

We have these 503's

client_conn_closed | 503 | Connection closed by client
no_proxies | 503 | No available proxies
banned | 503 | Proxy has been banned
serverbusy | 503 | Server busy: too many outstanding requests

Only the banned one is actually a ban. So I completely agree with changes proposed by @VMRuiz. Regardless of anything all 503's should be retried like it says on the Best Practices.

eLRuLL commented 5 years ago

This is a fix for #39

codecov[bot] commented 5 years ago

Codecov Report

Merging #64 into master will increase coverage by 0.07%. The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #64      +/-   ##
==========================================
+ Coverage   94.59%   94.66%   +0.07%     
==========================================
  Files           2        2              
  Lines         148      150       +2     
==========================================
+ Hits          140      142       +2     
  Misses          8        8
Impacted Files Coverage Δ
scrapy_crawlera/middleware.py 94.59% <100%> (+0.07%) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 28b38e4...e8b2ee9. Read the comment docs.

codecov[bot] commented 5 years ago

Codecov Report

Merging #64 into master will increase coverage by 0.07%. The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #64      +/-   ##
==========================================
+ Coverage   94.59%   94.66%   +0.07%     
==========================================
  Files           2        2              
  Lines         148      150       +2     
==========================================
+ Hits          140      142       +2     
  Misses          8        8
Impacted Files Coverage Δ
scrapy_crawlera/middleware.py 94.59% <100%> (+0.07%) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 28b38e4...e8b2ee9. Read the comment docs.