Closed wRAR closed 8 months ago
Merging #6038 (7da3964) into master (da39fbd) will increase coverage by
0.02%
. The diff coverage is100.00%
.:exclamation: Current head 7da3964 differs from pull request most recent head 6428356. Consider uploading reports for the commit 6428356 to get more accurate results
We can try moving the reactor installation to crawl()
This seems to work, at least when nothing imports twisted.internet.reactor between creating a Crawler instance (or a CrawlerProcess istance) and calling crawl() on it.
Direct link to the logging situation: https://github.com/scrapy/scrapy/pull/1580#issuecomment-153689154
In my experience changing logging per-spider is often done, e.g. if you have a large project where only some spiders need additional debugging. It can also be done via -s
but that should still be supported here.
Maybe we just need to have both a classmethod that is called early, for backward compatibility and modifying things that need to be set early, and a new instance method that is called late.
In any case it's not clear when should the addons modify the settings.
How about this:
update_settings()
: it's still a classmethod and it's still called early, it still can change any settings, including ADDONS
and LOG_LEVEL
;Crawler.__init__()
unless we also want to affect it in spider __init__()
;Crawler.crawl()
;__init__()
can now change the settings based on arguments, instance-specific logic etc. and it happens before component initialization so most settings can be changed without problems.Not sure where should the addon processing happen here, probably after the spider __init__()
, in which case it allows all code paths to modify ADDONS
. Not sure how bad is not being able to see settings modified by the addons themselves in all code paths though.
It sounds good to me in general, although I am not 100% sure.
I imagine some existing __init__
code might expect some components to exist beforehand, to access them from the crawler, but I am not sure how common that is. And since signals can be used to get another spider method called later on, maybe it is worth the backward-incompatible change?
Also, changing arguments based on final settings would not be possible before __init__
with this approach, since __init__
would be the one taking care of changing settings based on arguments. If we consider argument processing out of scope, then this is a non-issue. But what we choose here could limit what we can do when implementing argument processing support later on.
I imagine some existing
__init__
code might expect some components to exist beforehand
Note that this doesn't work in the current PR code too, as ExtensionManager
is initialized later.
changing arguments based on final settings would not be possible before
__init__
with this approach, since__init__
would be the one taking care of changing settings based on arguments
Not sure what is this use case?
changing arguments based on final settings would not be possible before
__init__
with this approach, since__init__
would be the one taking care of changing settings based on argumentsNot sure what is this use case?
I am thinking of a future where we implement argument validation and conversion as an opt-in feature by exposing a setting for a new type of Scrapy component, in line with the fingerprinter setting, e.g. ARG_PARSER
. The default parser could keep the current behavior, i.e. leave the arguments as they are, while a new parser could implement something like pydantic to error out in case of bad input, or convert strings to dict or int.
My thinking is that this setting-defined component, which could have settings of its own to control its behavior (e.g. a setting to disable exiting on error to instead only log a warning), should be executed before __init__
, so that __init__
would get those arguments already processed, or not be called at all if argument processing exists earlier.
To enable such a future scenario, I am thinking we should have the following order:
__init__
getting final settings and arguments.Another thing I've just found: Spider.__init__()
does not have access to the crawler: spider.crawler
and spider.settings
are set in from_crawler()
after the instance is created. So the place to access them is an overridden from_crawler()
, not an overridden __init__()
(or some new method we can call on the instance at the end of the default from_crawler()
.
There is a problem with moving the fingerprinter initialization inside crawl()
: too many tests assume that crawler.request_fingerprinter
exists just after creating the crawler.
I wonder if we want to split out a "prepare" method of Crawler
that creates the spider instance etc. but doesn't start the engine, though I don't know if it would be useful outside tests.
I've implemented the change that keeps update_settings()
as is and added a test that modifies settings in from_crawler()
. As from_crawler()
receives argument values it should cover some of the use cases we want.
We discussed this with @kmike and decided that we want to move more things into crawl() and fix tests to call it there. Also that we don't want to change arguments based on settings.
I've moved most things into crawl() and updateв tests to call it where needed, the only failing tests are ones for scrapy shell
. It's problematic because:
Not sure yet what to do with this, as the engine is right to assume that the crawler is initialized properly and we moved most of that initialization to crawl().
It may be possible to extract some test changes and merge them separately if we decide to keep this approach with crawler.crawl()
in tests, to simplify reviewing.
Not sure yet what to do with this, as the engine is right to assume that the crawler is initialized properly and we moved most of that initialization to crawl().
Would it make sense to add an load_settings
parameter to Crawler
, False
by default, and set it to True
only for that scenario, causing the setting loading code to run in __init__
instead of in crawl
?
scrapy shell
does not seem to allow spider arguments, so we don’t really need to support delayed setting loading for it.
Or, to keep things simple, we keep the code that you moved in a private method of Crawler, and on the shell code we call it manually:
crawler = self.crawler_process._create_crawler(spidercls)
crawler._load_settings()
Yeah, and we could call that method in get_crawler()
, like I proposed initially :)
Tests and docs added, please let me know if more are needed.
(pypy test failures related to the reactor need to be fixed before merging)
PyPy tests fail because on PyPy exceptions raised in Crawler.crawl()
are silently discarded or handled somewhere while on CPython they are unhandled and show in the log. This happens on master as well.
PyPy tests fail because on PyPy exceptions raised in
Crawler.crawl()
are silently discarded or handled somewhere while on CPython they are unhandled and show in the log. This happens on master as well.
We rely on the "Unhandled error in Deferred" messages that are printed when such Deferreds are garbage collected (we call CrawlerProcess.crawl()
without storing the Deferred that it returns), and on PyPy something works in some other way and they are not printed. So in tests where we expect an unhandled error we should instead add an explicit errback.
The PR looks good to me, but we should make the tests pass.
failure.printTraceback()
doesn't work on Twisted 18.9.0 on Python 3.8+. This is fixed in Twisted 19.7.0, and the first Twisted version that officially supports Python 3.8 is even higher, 21.2.0. We could replace the logging code in these tests or we could bump the minimal Twisted version, even though spiders work with 18.9.0 too.
Though it looks like using twisted.python.log.err()
still prints the exception we need so we can just switch to it.
The tests now pass.
Continuation of #1580, without code changes for now.
Several tests fail, because you can no longer set
TWISTED_REACTOR
andLOG_FILE
/LOG_LEVEL
incustom_settings
. Which is a problem. We can try moving the reactor installation tocrawl()
, but the logging problem was discussed in the old PR and doesn't have a good solution.Also, I think we should deprecate runningdoneCrawler.crawl()
multiple times as a separate uncontroversial PR, as proposed in the old one. #6040Closes #1580, fixes #1305, fixes #2392, fixes #3663.