scrapy-plugins / scrapy-zyte-smartproxy

Zyte Smart Proxy Manager (formerly Crawlera) middleware for Scrapy
BSD 3-Clause "New" or "Revised" License
357 stars 88 forks source link

[MRG + 1] Add DEFAULT_CRAWLERA_HEADERS settings #58

Closed hcoura closed 6 years ago

hcoura commented 6 years ago

After discussion on https://github.com/scrapy-plugins/scrapy-crawlera/pull/54 it was clear that DEFAULT_CRAWLERA_HEADERS was the preferable way to handle crawlera specific headers.

After much thinking on https://github.com/scrapy-plugins/scrapy-crawlera/pull/52 I believe, for now, we shouldn't change default headers as it will not be backwards compatible.

This PR:

If you are ok with this, let me know and I will add documentation accordingly

hcoura commented 6 years ago

Scrapy 1.0 doesn't have without_none_values which is making Travis fail. This method was introduced on 1.1 and used on DefaultHeadersMiddleware. So I assume for a important reason.

I wonder if we should actually support Scrapy 1.0, it's just a None check anyway, but I don't see a reason for supporting a version that old.

codecov[bot] commented 6 years ago

Codecov Report

Merging #58 into master will increase coverage by 0.39%. The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #58      +/-   ##
==========================================
+ Coverage   94.16%   94.55%   +0.39%     
==========================================
  Files           2        2              
  Lines         137      147      +10     
==========================================
+ Hits          129      139      +10     
  Misses          8        8
Impacted Files Coverage Δ
scrapy_crawlera/middleware.py 94.48% <100%> (+0.4%) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 63650a6...692b165. Read the comment docs.

stummjr commented 6 years ago

IIUC, the headers defined in DEFAULT_REQUEST_HEADERS will take precedence over the ones defined in CRAWLERA_DEFAULT_HEADERS. But this is only true if Crawlera middleware is placed after Scrapy's DefaultHeadersMiddleware, otherwise we'll have the opposite behavior.

As for the docs, I think we should mention that precedence will change depending on where you place your middleware. The docs already suggest adding it with the 610 priority, which is fine, but maybe better to be explicit regarding this when describing the headers behavior.

stummjr commented 6 years ago

I like the idea in general and the code looks good. 👍

eLRuLL commented 5 years ago

@stummjr @hcoura when did we decide to go from DEFAULT_CRAWLERA_HEADERS to CRAWLERA_DEFAULT_HEADERS, it looks inconsistent to me. What do you guys think?

hcoura commented 5 years ago

It does look inconsistent, likely a oversight of mine back in September.

What would you rather do?

eLRuLL commented 5 years ago

I just wanted confirmation that this isn't the way we should go, I thought it was discussed and agreed that should be the name of the settings. I'll add the necessary changes.

hcoura commented 5 years ago

Looking at this again, I don't think it is inconsistent, every setting on this repo has been CRAWLERA_...