scrapy-plugins / scrapy-zyte-smartproxy

Zyte Smart Proxy Manager (formerly Crawlera) middleware for Scrapy
BSD 3-Clause "New" or "Revised" License
356 stars 88 forks source link

enable spider attributes for the most common headers #54

Closed stummjr closed 6 years ago

stummjr commented 6 years ago

This PR fixes #53 by introducind support for Crawlera settings via spider attributes for some common headers, such as:

class FooSpider(scrapy.Spider):
    name = 'foo'
    crawlera_profile = 'mobile'  # sets `X-Crawlera-Profile: mobile` header in all requests
    crawlera_debug = 'ua'  # sets `X-Crawlera-Debug: ua` header in all requests
    ...

I didn't add support to the full list of headers, as we are just starting the discussion here.

codecov[bot] commented 6 years ago

Codecov Report

Merging #54 into master will increase coverage by 0.24%. The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #54      +/-   ##
==========================================
+ Coverage   93.65%   93.89%   +0.24%     
==========================================
  Files           2        2              
  Lines         126      131       +5     
==========================================
+ Hits          118      123       +5     
  Misses          8        8
Impacted Files Coverage Δ
scrapy_crawlera/middleware.py 93.79% <100%> (+0.25%) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 4c0418c...1ef7ac5. Read the comment docs.

eLRuLL commented 6 years ago

@stummjr I actually like the idea. But few things to discuss:

baitxo commented 6 years ago

@stummjr I also like the idea!

Regarding @eLRuLL questions:

eLRuLL commented 6 years ago

after checking #52 I think it would be more interesting to have something like DEFAULT_CRAWLERA_HEADERS, which could actually help only define headers that are part of crawlera.

This doesn't conflict with DEFAULT_REQUEST_HEADERS as that will only be used if the CrawleraMiddleware is enabled.

We can also enable that on spider level with the argument of default_crawlera_headers.

So, sorry but I think the current proposal is not really that necessary. We even had this discussion a while ago and decided to only stay with DEFAULT_REQUEST_HEADERS, that's why we give emphasis on that on the documentation.

stummjr commented 6 years ago

@eLRuLL @baitxo: I like the idea of having DEFAULT_CRAWLERA_HEADERS separated from DEFAULT_REQUEST_HEADERS. It is a very good replacement for the spider attributes I proposed here.

Some pros:

tl;dr: +1 to implement DEFAULT_CRAWLERA_HEADERS.