scrapy / scrapy

Scrapy, a fast high-level web crawling & scraping framework for Python.
https://scrapy.org
BSD 3-Clause "New" or "Revised" License
51.16k stars 10.35k forks source link

Cookiejars exposed #6218

Open GeorgeA92 opened 3 months ago

GeorgeA92 commented 3 months ago

Aimed to fix https://github.com/scrapy/scrapy/issues/1878 based on suggestion from https://github.com/scrapy/scrapy/issues/1878#issuecomment-200907033

codecov[bot] commented 3 months ago

Codecov Report

Merging #6218 (b8f8960) into master (6f73dc0) will increase coverage by 0.18%. Report is 88 commits behind head on master. The diff coverage is 100.00%.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #6218 +/- ## ========================================== + Coverage 88.48% 88.67% +0.18% ========================================== Files 160 161 +1 Lines 11607 11792 +185 Branches 1883 1912 +29 ========================================== + Hits 10271 10457 +186 + Misses 1009 1007 -2 - Partials 327 328 +1 ``` | [Files](https://app.codecov.io/gh/scrapy/scrapy/pull/6218?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=scrapy) | Coverage Δ | | |---|---|---| | [scrapy/downloadermiddlewares/cookies.py](https://app.codecov.io/gh/scrapy/scrapy/pull/6218?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=scrapy#diff-c2NyYXB5L2Rvd25sb2FkZXJtaWRkbGV3YXJlcy9jb29raWVzLnB5) | `96.33% <100.00%> (+0.13%)` | :arrow_up: | ... and [14 files with indirect coverage changes](https://app.codecov.io/gh/scrapy/scrapy/pull/6218/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=scrapy)
wRAR commented 3 months ago

I haven't read the original ticket recently, but why is this feature optional?

GeorgeA92 commented 2 months ago

This is how it works on current version of PR

script_sample.py ```python import scrapy from scrapy.crawler import CrawlerProcess class Quotes(scrapy.Spider): name = "quotes"; custom_settings = {"DOWNLOAD_DELAY": 1} def start_requests(self): yield scrapy.Request(url='https://quotes.toscrape.com/login', callback=self.login) def login(self, response): self.logger.info(self.cookie_jars[None]) # scrapy.http.cookies.CookieJar object self.logger.info(self.cookie_jars[None].jar) # http.cookiejar object locale_cookie = self.cookie_jars[None]._cookies["quotes.toscrape.com"]["/"].get("session") locale_cookie.value = locale_cookie.value.upper() self.logger.info(self.cookie_jars[None].jar) if __name__ == "__main__": p = CrawlerProcess(); p.crawl(Quotes); p.start() ```
log_output (fragment) ``` 2024-02-23 10:51:27 [scrapy.core.engine] DEBUG: Crawled (200) (referer: None) 2024-02-23 10:51:27 [quotes] INFO: 2024-02-23 10:51:27 [quotes] INFO: ]> 2024-02-23 10:51:27 [quotes] INFO: ]> 2024-02-23 10:51:27 [scrapy.core.engine] INFO: Closing spider (finished) ```
Gallaecio commented 2 months ago

I‘m slightly hesitant about setting a spider attribute from a middleware, and I wonder if maybe it should be set from a different place or in a different place (e.g. the crawler), bun in general I’m find with the approach.

@kmike Any thoughts on the general approach? Should @GeorgeA92 go on with tests and docs?

kmike commented 2 months ago

Hey! My main worry is the obscure API, which we'd need to document & support in the future. It'd require good documentation to explain a line like

self.cookie_jars[None]._cookies["quotes.toscrape.com"]["/"].get("session")

It also need an access to a private property (._cookies).

GeorgeA92 commented 1 month ago

Hey! My main worry is the obscure API, which we'd need to document & support in the future. It'd require good documentation to explain a line like

self.cookie_jars[None]._cookies["quotes.toscrape.com"]["/"].get("session")

It also need an access to a private property (._cookies).

Another option is to update scrapy.http.Cookies.CookieJar class to add.. more convenient way to interact with Cookiejar

https://github.com/scrapy/scrapy/blob/1d11ea3a54607b436f9a88f07911902a4882f0e8/scrapy/http/cookies.py#L18-L30