scrapy-plugins / scrapy-zyte-smartproxy

Zyte Smart Proxy Manager (formerly Crawlera) middleware for Scrapy

BSD 3-Clause "New" or "Revised" License

356 stars 88 forks source link

enable spider attributes for the most common headers #54

Closed stummjr closed 6 years ago

stummjr commented 6 years ago

This PR fixes #53 by introducind support for Crawlera settings via spider attributes for some common headers, such as:

class FooSpider(scrapy.Spider):
    name = 'foo'
    crawlera_profile = 'mobile'  # sets `X-Crawlera-Profile: mobile` header in all requests
    crawlera_debug = 'ua'  # sets `X-Crawlera-Debug: ua` header in all requests
    ...

I didn't add support to the full list of headers, as we are just starting the discussion here.

codecov[bot] commented 6 years ago

Codecov Report

Merging #54 into master will increase coverage by 0.24%. The diff coverage is 100%.

@@            Coverage Diff             @@
##           master      #54      +/-   ##
==========================================
+ Coverage   93.65%   93.89%   +0.24%     
==========================================
  Files           2        2              
  Lines         126      131       +5     
==========================================
+ Hits          118      123       +5     
  Misses          8        8

Impacted Files	Coverage Δ
scrapy_crawlera/middleware.py	`93.79% <100%> (+0.25%)`	:arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 4c0418c...1ef7ac5. Read the comment docs.

eLRuLL commented 6 years ago

@stummjr I actually like the idea. But few things to discuss:

We'll have to give support for all crawlera available headers.
What should have precedence here, DEFAULT_REQUEST_HEADERS or these ones?
Should we also give settings-level support?

baitxo commented 6 years ago

@stummjr I also like the idea!

Regarding @eLRuLL questions:

We'll have to give support for all crawlera available headers. Sure!
What should have precedence here, DEFAULT_REQUEST_HEADERS or these ones? I believe DEFAULT_REQUEST_HEADERS should have precedence. But we can raise a warning on spider_opened when we detect a collision.
Should we also give settings-level support? I believe so! That way we could provide better defaults like it was proposed in #52

eLRuLL commented 6 years ago

after checking #52 I think it would be more interesting to have something like DEFAULT_CRAWLERA_HEADERS, which could actually help only define headers that are part of crawlera.

This doesn't conflict with DEFAULT_REQUEST_HEADERS as that will only be used if the CrawleraMiddleware is enabled.

We can also enable that on spider level with the argument of default_crawlera_headers.

So, sorry but I think the current proposal is not really that necessary. We even had this discussion a while ago and decided to only stay with DEFAULT_REQUEST_HEADERS, that's why we give emphasis on that on the documentation.

stummjr commented 6 years ago

@eLRuLL @baitxo: I like the idea of having DEFAULT_CRAWLERA_HEADERS separated from DEFAULT_REQUEST_HEADERS. It is a very good replacement for the spider attributes I proposed here.

Some pros:

no maintenance burden when new headers are added or old ones get deprecated;
you have the same convenience at the spider level as with attributes, via custom_settings;
no need to translate header names into proper attribute names and keep a mental map of that;
avoid mistakenly sending X-Crawlera-* headers when the plugin isn't enabled (comparing with DEFAULT_REQUEST_HEADERS);

tl;dr: +1 to implement DEFAULT_CRAWLERA_HEADERS.