scrapy / scrapy

Scrapy, a fast high-level web crawling & scraping framework for Python.
https://scrapy.org
BSD 3-Clause "New" or "Revised" License
51.16k stars 10.35k forks source link

Change extensions/spiders/settings initialisation order, v2 #6038

Closed wRAR closed 8 months ago

wRAR commented 8 months ago

Continuation of #1580, without code changes for now.

Several tests fail, because you can no longer set TWISTED_REACTOR and LOG_FILE/LOG_LEVEL in custom_settings. Which is a problem. We can try moving the reactor installation to crawl(), but the logging problem was discussed in the old PR and doesn't have a good solution.

Also, I think we should deprecate running Crawler.crawl() multiple times as a separate uncontroversial PR, as proposed in the old one. #6040 done

Closes #1580, fixes #1305, fixes #2392, fixes #3663.

codecov[bot] commented 8 months ago

Codecov Report

Merging #6038 (7da3964) into master (da39fbd) will increase coverage by 0.02%. The diff coverage is 100.00%.

:exclamation: Current head 7da3964 differs from pull request most recent head 6428356. Consider uploading reports for the commit 6428356 to get more accurate results

Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #6038 +/- ## ========================================== + Coverage 88.92% 88.95% +0.02% ========================================== Files 163 163 Lines 11538 11567 +29 Branches 1877 1877 ========================================== + Hits 10260 10289 +29 Misses 969 969 Partials 309 309 ``` | [Files Changed](https://app.codecov.io/gh/scrapy/scrapy/pull/6038?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=scrapy) | Coverage Δ | | |---|---|---| | [scrapy/addons.py](https://app.codecov.io/gh/scrapy/scrapy/pull/6038?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=scrapy#diff-c2NyYXB5L2FkZG9ucy5weQ==) | `83.33% <ø> (ø)` | | | [scrapy/settings/\_\_init\_\_.py](https://app.codecov.io/gh/scrapy/scrapy/pull/6038?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=scrapy#diff-c2NyYXB5L3NldHRpbmdzL19faW5pdF9fLnB5) | `95.45% <ø> (ø)` | | | [scrapy/commands/shell.py](https://app.codecov.io/gh/scrapy/scrapy/pull/6038?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=scrapy#diff-c2NyYXB5L2NvbW1hbmRzL3NoZWxsLnB5) | `93.61% <100.00%> (+0.13%)` | :arrow_up: | | [scrapy/core/engine.py](https://app.codecov.io/gh/scrapy/scrapy/pull/6038?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=scrapy#diff-c2NyYXB5L2NvcmUvZW5naW5lLnB5) | `85.66% <100.00%> (+0.27%)` | :arrow_up: | | [scrapy/core/scraper.py](https://app.codecov.io/gh/scrapy/scrapy/pull/6038?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=scrapy#diff-c2NyYXB5L2NvcmUvc2NyYXBlci5weQ==) | `85.18% <100.00%> (+0.15%)` | :arrow_up: | | [scrapy/crawler.py](https://app.codecov.io/gh/scrapy/scrapy/pull/6038?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=scrapy#diff-c2NyYXB5L2NyYXdsZXIucHk=) | `88.20% <100.00%> (+0.70%)` | :arrow_up: | | [scrapy/downloadermiddlewares/httpcache.py](https://app.codecov.io/gh/scrapy/scrapy/pull/6038?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=scrapy#diff-c2NyYXB5L2Rvd25sb2FkZXJtaWRkbGV3YXJlcy9odHRwY2FjaGUucHk=) | `94.04% <100.00%> (+0.07%)` | :arrow_up: | | [scrapy/downloadermiddlewares/retry.py](https://app.codecov.io/gh/scrapy/scrapy/pull/6038?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=scrapy#diff-c2NyYXB5L2Rvd25sb2FkZXJtaWRkbGV3YXJlcy9yZXRyeS5weQ==) | `100.00% <100.00%> (ø)` | | | [scrapy/dupefilters.py](https://app.codecov.io/gh/scrapy/scrapy/pull/6038?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=scrapy#diff-c2NyYXB5L2R1cGVmaWx0ZXJzLnB5) | `83.13% <100.00%> (+0.41%)` | :arrow_up: | | [scrapy/extensions/httpcache.py](https://app.codecov.io/gh/scrapy/scrapy/pull/6038?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=scrapy#diff-c2NyYXB5L2V4dGVuc2lvbnMvaHR0cGNhY2hlLnB5) | `95.47% <100.00%> (+0.01%)` | :arrow_up: | | ... and [2 more](https://app.codecov.io/gh/scrapy/scrapy/pull/6038?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=scrapy) | |
wRAR commented 8 months ago

We can try moving the reactor installation to crawl()

This seems to work, at least when nothing imports twisted.internet.reactor between creating a Crawler instance (or a CrawlerProcess istance) and calling crawl() on it.

wRAR commented 8 months ago

Direct link to the logging situation: https://github.com/scrapy/scrapy/pull/1580#issuecomment-153689154

In my experience changing logging per-spider is often done, e.g. if you have a large project where only some spiders need additional debugging. It can also be done via -s but that should still be supported here.

wRAR commented 8 months ago

Maybe we just need to have both a classmethod that is called early, for backward compatibility and modifying things that need to be set early, and a new instance method that is called late.

In any case it's not clear when should the addons modify the settings.

wRAR commented 8 months ago

How about this:

Not sure where should the addon processing happen here, probably after the spider __init__(), in which case it allows all code paths to modify ADDONS. Not sure how bad is not being able to see settings modified by the addons themselves in all code paths though.

Gallaecio commented 8 months ago

It sounds good to me in general, although I am not 100% sure.

I imagine some existing __init__ code might expect some components to exist beforehand, to access them from the crawler, but I am not sure how common that is. And since signals can be used to get another spider method called later on, maybe it is worth the backward-incompatible change?

Also, changing arguments based on final settings would not be possible before __init__ with this approach, since __init__ would be the one taking care of changing settings based on arguments. If we consider argument processing out of scope, then this is a non-issue. But what we choose here could limit what we can do when implementing argument processing support later on.

wRAR commented 8 months ago

I imagine some existing __init__ code might expect some components to exist beforehand

Note that this doesn't work in the current PR code too, as ExtensionManager is initialized later.

changing arguments based on final settings would not be possible before __init__ with this approach, since __init__ would be the one taking care of changing settings based on arguments

Not sure what is this use case?

Gallaecio commented 8 months ago

changing arguments based on final settings would not be possible before __init__ with this approach, since __init__ would be the one taking care of changing settings based on arguments

Not sure what is this use case?

I am thinking of a future where we implement argument validation and conversion as an opt-in feature by exposing a setting for a new type of Scrapy component, in line with the fingerprinter setting, e.g. ARG_PARSER. The default parser could keep the current behavior, i.e. leave the arguments as they are, while a new parser could implement something like pydantic to error out in case of bad input, or convert strings to dict or int.

My thinking is that this setting-defined component, which could have settings of its own to control its behavior (e.g. a setting to disable exiting on error to instead only log a warning), should be executed before __init__, so that __init__ would get those arguments already processed, or not be called at all if argument processing exists earlier.

To enable such a future scenario, I am thinking we should have the following order:

  1. existing settings editing
  2. (future) argument parsing based on settings
  3. setting editing based on (parsed) arguments
  4. __init__ getting final settings and arguments.
wRAR commented 8 months ago

Another thing I've just found: Spider.__init__() does not have access to the crawler: spider.crawler and spider.settings are set in from_crawler() after the instance is created. So the place to access them is an overridden from_crawler(), not an overridden __init__() (or some new method we can call on the instance at the end of the default from_crawler().

wRAR commented 8 months ago

There is a problem with moving the fingerprinter initialization inside crawl(): too many tests assume that crawler.request_fingerprinter exists just after creating the crawler.

I wonder if we want to split out a "prepare" method of Crawler that creates the spider instance etc. but doesn't start the engine, though I don't know if it would be useful outside tests.

wRAR commented 8 months ago

I've implemented the change that keeps update_settings() as is and added a test that modifies settings in from_crawler(). As from_crawler() receives argument values it should cover some of the use cases we want.

wRAR commented 8 months ago

We discussed this with @kmike and decided that we want to move more things into crawl() and fix tests to call it there. Also that we don't want to change arguments based on settings.

wRAR commented 8 months ago

I've moved most things into crawl() and updateв tests to call it where needed, the only failing tests are ones for scrapy shell. It's problematic because:

https://github.com/scrapy/scrapy/blob/721df895f9ea9d8073c13fbd2f75a6fbdc75ffc7/scrapy/commands/shell.py#L77-L82

Not sure yet what to do with this, as the engine is right to assume that the crawler is initialized properly and we moved most of that initialization to crawl().

wRAR commented 8 months ago

It may be possible to extract some test changes and merge them separately if we decide to keep this approach with crawler.crawl() in tests, to simplify reviewing.

Gallaecio commented 8 months ago

Not sure yet what to do with this, as the engine is right to assume that the crawler is initialized properly and we moved most of that initialization to crawl().

Would it make sense to add an load_settings parameter to Crawler, False by default, and set it to True only for that scenario, causing the setting loading code to run in __init__ instead of in crawl?

scrapy shell does not seem to allow spider arguments, so we don’t really need to support delayed setting loading for it.

Gallaecio commented 8 months ago

Or, to keep things simple, we keep the code that you moved in a private method of Crawler, and on the shell code we call it manually:

crawler = self.crawler_process._create_crawler(spidercls) 
crawler._load_settings()
wRAR commented 8 months ago

Yeah, and we could call that method in get_crawler(), like I proposed initially :)

wRAR commented 8 months ago

Tests and docs added, please let me know if more are needed.

wRAR commented 8 months ago

(pypy test failures related to the reactor need to be fixed before merging)

wRAR commented 8 months ago

PyPy tests fail because on PyPy exceptions raised in Crawler.crawl() are silently discarded or handled somewhere while on CPython they are unhandled and show in the log. This happens on master as well.

wRAR commented 8 months ago

PyPy tests fail because on PyPy exceptions raised in Crawler.crawl() are silently discarded or handled somewhere while on CPython they are unhandled and show in the log. This happens on master as well.

We rely on the "Unhandled error in Deferred" messages that are printed when such Deferreds are garbage collected (we call CrawlerProcess.crawl() without storing the Deferred that it returns), and on PyPy something works in some other way and they are not printed. So in tests where we expect an unhandled error we should instead add an explicit errback.

kmike commented 8 months ago

The PR looks good to me, but we should make the tests pass.

wRAR commented 8 months ago

failure.printTraceback() doesn't work on Twisted 18.9.0 on Python 3.8+. This is fixed in Twisted 19.7.0, and the first Twisted version that officially supports Python 3.8 is even higher, 21.2.0. We could replace the logging code in these tests or we could bump the minimal Twisted version, even though spiders work with 18.9.0 too.

wRAR commented 8 months ago

Though it looks like using twisted.python.log.err() still prints the exception we need so we can just switch to it.

wRAR commented 8 months ago

The tests now pass.