Closed jdemaeyer closed 9 months ago
I think one of the major open questions is where the add-on manager should live. Here's an excerpt from my proposal on that (I have opted for the inside of crawler option for the SEP so far):
The question where the add-on manager should be instantiated and when the add-on callbacks should be called is not as obvious as it may seem at first glance. Outlined below are three options and their rationale:
Besides the different priority and level of direct user involvement, settings made by the add-on callbacks are not different from settings made in settings.py
. The callbacks should therefore be called around the same time that the settings module is loaded. This puts the instantiation of an AddonManager
instance outside of the Crawler
instance.
Given that the project settings are read in the get_project_settings()
function in scrapy.util.project
, this seems a reasonable place to call AddonManager.update_settings()
. However, we cannot instantiate the add-on manager within this function, as the function is left (and the manager would therefore be lost) long before the crawler becomes ready (when we wish to call the second round of add-on callbacks).
There are two possible approaches to instantiating the add-on manager outside of get_project_settings()
:
get_project_settings()
is embedded into the execute()
function in scrapy.cmdline
. In summary, the changes to this function would be (with analogoue changes in scrapy/conf.py
, where necessary, for backwards-compatibility):
get_project_settings()
get_project_settings()
when calling it (the function then calls update_settings()
).check_configuration
method to the engine_started
signal (this could also be done in the add-on managers 'init()' method)scrapy.addonmanager.addonmanager
). This would allow moving the above code related to add-on management above into the more appropriate get_project_settings()
function.Integrating add-on management outside of the crawler ensures that settings management, except for spider-specific settings, remains within a rather small part of the start-up process (before instantiating a Crawler
): the helper function get_project_settings()
indeed keeps returning the full set of project (i.e. non-spider) settings. The downside is that it either introduces a new singleton or clutters up a function (execute()
) that should be more about parsing command line options and starting a crawler process than about loading add-ons and settings.
The settings used to crawl are not complete until the spider-specific settings have been loaded in Crawler.__init__()
. Add-on management could follow this approach and only start loading add-ons when the crawler is initialised.
Instantiation and the call to AddonManager.update_settings()
would happen in Crawler.__init__()
. The final checks (i.e. the callback to AddonManager.check_configuration()
) could then again either be tied to the engine_started
signal, or coded into the Crawler.crawl()
method after creating the engine.
Integrating add-on management inside of the crawler avoids introducing a new singleton or cluttering up the execute()
function, but rips apart compiling the complete configuration settings. This is especially critical since many of the settings previously made in settings.py
will move to scrapy.cfg
with the implementation of this proposal, and may prove backwards-incompatible since the get_project_settings()
helper function no longer works as expected (and as its name suggests).
Hey @jdemaeyer,
I've read your proposal and the SEP for the first time, so please excuse me if I'm missing some context :)
scrapy runspider
is used to start a spider?MiddlewareManager.from_settings()
will be changed to allow python objects instead of class paths. Why should it be specific to MiddlewareManager? (see discussion at https://github.com/scrapy/scrapy/pull/1215).scrapy crawl
init sequence, and AddonManager calls various addon methods itself. Can we made it so that user can also use the AddonManger (or individual addon) to update or check settings, without invoking all the scrapy crawl
machinery? It looks like the proposed AddonManager does two unrelated things:
IMHO it is better to separate those. Scrapy users who don't use scrapy crawl
command may prefer to import addon classes themselves, or these classes can be in the same module as CrawlerProcess call and a spider, all in the same self-contained Python script.
// a rant: I think it is usually a bad idea to name somathing ..Manager :) "Manager" tells us that a class "manages" something. It is quite broad, so 1) often such classes end up doing several unrelated things and 2) it is not clear what class is doing just from the class name - it can do anything. In Scrapy we renamed SpiderManager to SpiderLoader because of that. It turns out this is not only mine opinion.
Hey @kmike, thanks for the feedback!
1 . Is it a coincidence that scrapy.Spider implements update_settings method which addons will implement?
That is on purpose. Spider.update_settings()
and the add-on update_settings()
both do roughly the same, i.e. update settings outside of settings.py
, albeit with a different level of "generality"
2 . Addons are utilities to auto-update and check Settings objects? Why do they need Crawler?
Add-ons should create a plug-and-play experience for the user. For checking that everything "Works As Advertised™", and provide help when it doesn't, they need to know the actual state of the crawler, and not just how its state is supposed to be from the configuration. E.g. an extension listed in settings['EXTENSIONS']
could have thrown NotConfigured
. Add-ons can only detect this if they have access to the crawler.
3 . How can user activate an addon when crawling is started by CrawlerRunner or CrawlerProcess?
Hm, don't both of these instantiate Crawler
objects where the add-on manager would then be instantiated and load add-ons? Then again I guess it again comes down to whether a scrapy.cfg
exists in these use cases (also see answer to 6).
4 . How can user enable an addon if there is no Scrapy project? 5 . How to enable an addon if scrapy runspider is used to start a spider?
That would not be possible with the current suggested implementation. Just like you cannot have a settings.py
file or use scrapyd-deploy
in these cases. It could be made possible if add-on configuration were to live in settings
using per-spider settings. I'll put the (current) rationale for scrapy.cfg
below.
6 . Why is scrapy.cfg needed? It is confusing to have two config files, and it is an unnecessary boilerplate if you don't have a Scrapy project.
I opted for scrapy.cfg
for two reasons:
It makes it easy to keep settings local through the ini sections. E.g. something like this:
[mysql-pipe]
server = some.server
port = 1234
user = a
password = b
[mongodb-pipe]
server = some.other.server
user = c
password = d
is much easier in scrapy.cfg
than in settings.py
, and would not have to expose anything into the settings
object, avoiding possible nameclashes.
I think if done properly, users won't have to touch settings.py
for any of the shipped add-ons at all (instead they could set, say HTTPCACHE_DIR
, in the scrapy.cfg
. We could even tidy up the settings namespace this way). Those users that then need to touch settings.py
will most probably know what they're doing anyways.
7 . In the SEP it is said that MiddlewareManager.from_settings() will be changed to allow python objects instead of class paths. Why should it be specific to MiddlewareManager? (see discussion at #1215).
Uh, I didn't see #1215 before, that looks good. The SEP goes a little further than the PR though: It suggests to allow not only classes, but instances. This will allow add-ons to instantiate components and keep settings as local as possible. Objects retrieved from X = load_object()
are (so far) always assumed to be classes and then instantiated with y = X()
. That should probably stay that way, and it is a good question where the "allow instances" code could live so it is as generic as possible.
8 . Do you think it is possible to make Scrapy addons feature less framework-ish? In the current proposal Scrapy calls various AddonManager methods during the standard scrapy crawl init sequence, and AddonManager calls various addon methods itself. Can we made it so that user can also use the AddonManger (or individual addon) to update or check settings, without invoking all the scrapy crawl machinery?
Hm, I'd say it is already quite modular. The user could in principle instantiate their own AddonManager
and hand it the .cfg
file. Or they could load an add-on object and call its update_settings()
method directly.
9 . It looks like the proposed AddonManager does two unrelated things:
- it loads addon classes from scrapy.cfg;
- it uses loaded addons to update settings and check crawler. IMHO it is better to separate those. Scrapy users who don't use scrapy crawl command may prefer to import addon classes themselves, or these classes can be in the same module as CrawlerProcess call and a spider, all in the same self-contained Python script.
That is a good point. The Manager idea arised from the #591 discussion, but now that I come to think of it splitting it up seems to make more sense. Also, it would avoid having a "...Manager" named class :). The loaded add-ons need to live somewhere between loading <-> updating <-> checking, though. With the current SEP, that could be in an attribute of Crawler
(much like extensions).
Maybe like this?
AddonLoader
class, or even only a function, is handed the scrapy.cfg
file and provides/returns an { 'addon name': (<addon object>, <addon config>) }
dict. The check_dependency_clashes()
functionality also lives within this class/function.addons
dict is given to the Crawler
object upon instantiation, much like settings
.Crawler
calls the add-ons update_settings()
and check_configuration()
functions at appropriate times.Crawler
to call the callbacks (e.g. because she does it somewhere else), she hands over an empty addons dict on instantiation. In this case, it might be tricky that the check_configuration()
callback requires the crawler object.I'm still quite unhappy about the scrapy.cfg
/settings.py
issue. It is indeed annoying overhead to have two configuration files. But then, I think it makes sense that add-ons and their configuration are not present in any Settings
instance, and the scrapy.cfg
syntax is very much suited for user-friendliness and local settings.
Hi @jdemaeyer,
That is on purpose. Spider.update_settings() and the add-on update_settings() both do roughly the same, i.e. update settings outside of settings.py, albeit with a different level of "generality"
Hm, do you think we can use exactly the same code path for Spider then? I.e. maybe spider can implement addon interface itself? I'm not sure when can it be useful, but why not :) This way a spider will be able not only to update settings, but also to check them.
It makes it easy to keep settings local through the ini sections. E.g. something like this: (... snip ...) is much easier in scrapy.cfg than in settings.py, and would not have to expose anything into the settings object, avoiding possible nameclashes.
I don't think it is a problem to separate sections in settings.py. Django-like approach works:
MYSQL_PIPE = dict(
server='some-server',
port=1234,
user='foo',
password='bar'
)
Maybe like this? ...
Yeah, your approach looks good to me.
Some questions:
Why are { 'addon name': (<addon object>, <addon config>) }
dicts needed, what are keys ('addon names') for? Can it be a list of (<addon class>, <addon config>)
pairs, or just a list of instantiated addon objects?
I'm still quite unhappy about the scrapy.cfg/settings.py issue. It is indeed annoying overhead to have two configuration files. But then, I think it makes sense that add-ons and their configuration are not present in any Settings instance, and the scrapy.cfg syntax is very much suited for user-friendliness and local settings.
What's the problem with having ADDONS list in settings.py? It means one less config file and more flexibility - addons will be able to disable or enable or check other addons. This way you also avoid changing Crawler API and other realted APIs - no need to add 'addons' argument. I think that two config files is less user-friendly than one config file. By default settings.py may contain only this ADDONS option.
I've written the add-on interface spec and two add-on loaders, one loading from scrapy.cfg
and one loading from settings.py
.
Is there a way I can retrieve the project module name? I could use the first part of os.env['SCRAPY_SETTINGS_MODULE']
but it seems to me that there should be a cleaner way.
Again, thanks for the feedback @kmike!
Hm, do you think we can use exactly the same code path for Spider then?
I think that should be possible, but I'm not sure how enforce that the update_settings()
method is not called twice.
I don't think it is a problem to separate sections in settings.py. Django-like approach works:
One problem with the current SEP is that the add-on name (= variable name) is allowed to be a python path to the add-on, but python does not allow dots in variable names. I have therefore allowed a _name
variable in the settings.py
dicts for now.
Why are { 'addon name': (
, ) } dicts needed, what are keys ('addon names') for? Can it be a list of ( , ) pairs, or just a list of instantiated addon objects?
The keys are simply whatever the addon was called in scrapy.cfg
/settings.py
. I'm not yet fully sure if we can drop it this early so I'd prefer to leave it in. It cannot be a list of instantiated add-on objects with the current SEP b/c we also allow add-ons to be modules right now (i.e. cannot be instantiated)
I don't really like the idea of exposing add-ons in the global settings, because they are essentially "one level above it": Add-ons fiddle with settings and extensions, but never the other way around. That's why I've designed the settings module add-on loader to only load variables starting with addon_
(in lower case, so they're not picked up by Settings.setmodule()
) so far. It would be cool to allow add-ons to load/configure other add-ons, though. In that case an add-on manager could become useful again...
After some more coding, I've deciced to again merge the AddonLoader
and what I called the AddonHolder
into an AddonManager
. There are just too many cross-dependencies: I wanted the holder to still be able to load add-ons last minute (allowing add-ons to load other add-ons, though I'm not exactly sure whether we want to allow this) and check dependency issues, at the same time get rid of the confusing { used_key: (addon, configuration) }
structure if the loader.
Now, the AddonManager
provides two class methods that the user may use when she doesn't want to use the 'holding' part, and provides convenience methods to add add-ons in case she doesn't want to use the 'loading' part.
I have integrated the add-on system into Scrapy's start-up process (it currently reads from both scrapy.cfg
and the settings module), but did no testing of a "fully integrated" system so far.
One issue currently in my head is: Do we allow add-ons to load/configure other add-ons? I have not fully thought through the implications of this. The main problem I see is that at the moment the first callback of an add-on is called (update_settings()
), other add-ons might already have been called. I could see this work if we require update_settings()
to not do any settings introspection/dependency checking/interaction with other add-ons whatsoever. Being able to load/configure an add-on at this point could be useful for "umbrella add-ons" that load other add-ons depending on their configuration
I've put some thought into the "add-ons loading/configuring other add-ons" issue. The big problem with the current implementation is that the first time code of an add-on is executed is during the first round of callbacks to update_settings()
. Should an add-on load or reconfigure another add-on here, other add-ons might already have been called. While it is possible to ensure that the update_settings()
method of the newly added add-on is called, there is no guarantee (and in fact, it is quite unlikely) that all add-ons see the same add-on configuration in their update_settings()
.
I see three possible approaches to this:
update_settings()
, instead only allow them to do changes to the settings
object or load other add-ons. In this case, configuring already enabled add-ons should be avoided, as there is no guarantee that their update_settings()
method has not already been calledupdate_addons(config, addonmgr)
, to the add-on interface. Only loading and configuring other add-ons should be done in this method. While it may be allowed, developers should be aware that depending on config
(of their own add-on, i.e. the one whose update_addons()
is currently called) is fragile, as once again, there is no guarantee in which order add-ons will be called back and whether some other add-on will want to touch config
.Currently I have implemented option 2. I have not quite made up my mind about it just yet, but I think I option 3 is my favourite for now.
I'm trying to write a test that ensures the crawler breaks down when an add-on's check_configuration()
callback raises an exception, but I don't know enough about Twisted to understand why it doesn't work :/
Here's the test, put into the CrawlTestCase
in tests/test_crawl.py
:
@defer.inlineCallbacks
def test_abort_on_addon_failed_check(self):
class FailedCheckAddon(Addon):
NAME = 'FailedCheckAddon'
VERSION = '1.0'
def check_configuration(self, crawler):
raise ValueError
addonmgr = AddonManager()
addonmgr.add(FailedCheckAddon())
crawler = get_crawler(SimpleSpider, addons=addonmgr)
yield self.assertFailure(crawler.crawl(), ValueError)
And this is the output I get from tox:
====================================================== FAILURES ======================================================
___________________________________ CrawlTestCase.test_abort_on_addon_failed_check ___________________________________
NOTE: Incompatible Exception Representation, displaying natively:
DirtyReactorAggregateError: Reactor was unclean.
DelayedCalls: (set twisted.internet.base.DelayedCall.debug = True to debug)
<DelayedCall 0x7ff8a481a950 [59.7946248055s] called=0 cancelled=0 LoopingCall<60>(Downloader._slot_gc, *(), **{})()>
I've also tried assertRaises
, that tells me:
FailTest: ValueError not raised (<Deferred at 0x7fea3d2b3e60 current result: <twisted.python.failure.Failure <type 'exceptions.ValueError'>>> returned)
The twisted error when testing check_configure()
is a little tricky, and it's actually something that wasn't handled before.
====================================================== FAILURES ====================================================== ___________________________________ CrawlTestCase.test_abort_on_addon_failed_check ___________________________________ NOTE: Incompatible Exception Representation, displaying natively: DirtyReactorAggregateError: Reactor was unclean. DelayedCalls: (set twisted.internet.base.DelayedCall.debug = True to debug) <DelayedCall 0x7ff8a481a950 [59.7946248055s] called=0 cancelled=0 LoopingCall<60>(Downloader._slot_gc, *(), **{})()>
This output means there's a pending call that wasn't stopped properly. In that case is a twisted LoopingCall started by the Downloader. The downloader is stored in the engine, so that call reference can be accessed in a crawler as crawler.engine.downloader._slot_gc_loop
.
The engine is instantiated in Crawler.crawl(). Problem is, that call is already running when check_configure()
is called and fails in FailedCheckAddon, but there is no place that addresses that and closes the downloader (which stops the looping call) properly. Not sure why that wasn't handled for errors happening after creating the engine, I guess it's not noticeable unless running tests that tell you about it.
So, there are two ways of fixing this. First one is to move the line calling addons.check_configuration()
above the instantiation of the engine. This would fix the test but doesn't help solving the underlining problem. Second approach would be to close the downloader if any problem raises from the commands after the engine creation. Something like:
try:
self.spider = self._create_spider(*args, **kwargs)
self.engine = self._create_engine()
self.addons.check_configuration(self)
...
except Exception:
self.crawling = False
if self.engine is not None:
self.engine.downloader.close()
raise
should be enough. If you want to you can implement this fix in this pr (or in another one) or we can implement it at some other time and follow the first approach for now.
Last thing about the test: I'd refrain to use self.assertFailure()
if possible, it's something present in trial
and not in the builtin unittest
that can be omitted. This should work too:
@defer.inlineCallbacks
def test_abort_on_addon_failed_check(self):
class FailedCheckAddon(Addon):
NAME = 'FailedCheckAddon'
VERSION = '1.0'
def check_configuration(self, crawler):
raise ValueError
crawler = get_crawler(SimpleSpider)
crawler.addons.add(FailedCheckAddon())
with self.assertRaises(ValueError):
yield crawler.crawl()
I think there are some things to discuss about the design, I took note on some questions and suggestions I had so we can talk them through.
There are some concerns about the usage of the scrapy.cfg
file to set the addons configuration. I agree here with @kmike that we need to support using the current settings for a couple of reasons:
retry
setting with redirect
middlewares).scrapy.cfg
can't be the only way of configuring addons because having a project is not required for running scrapy, setting core components is something that should be possible with single spiders.custom_settings
. Both options, using the current settings and the scrapy.cfg
files could coexist, but if there is any complication keeping both we should favor the first one.
On a related note, I'm not sure it's a good idea to discard _ENABLED
settings (read about discarding them in the SEP), specially if we're keeping the possibility of dynamic configuration. Users may want to disable a component temporary without deleting or commenting out the whole .cfg
section.
Some miscellaneous comments in the addons interface:
update_settings()
and check_configuration()
be class methods?check_configuration()
method in the addons? I understand update_settings()
being required, addons' only connection with the crawling engine is by updating the settings, there is no point in not having this method, but enforcing check_configuration()
seems a little restrictive. We could check in AddonManager.check_configuration()
if the method exists or not and call it in the first case.Pheew okay, improving the error handling in Crawler.crawl()
took me an awful lot of time and I learned quite a bit about Deferreds and Scrapy's architectural details. I'm still really unhappy, so here's what I have, up for discussion. ;) (Sorry for the wall of text, I'm doing this also to sort out my thoughts)
So I put in the
if self.engine is not None:
self.engine.downloader.close()
in the Crawler.crawl()
except
clause as you suggested, and the test I had worked fine. But of course I wanted to introduce that code The Right Way™ so I wrote another test that checks whether the crawler shuts down gracefully when there's any error during the crawl startup process.
The test is in tests.test_crawl.CrawlTestCase.test_graceful_crawl_error_handling
. I mocked ExecutionEngine.start
to throw an exception and asserted that Crawler.crawling == False
and that the downloader's close
method has been called (although I couldn't figure out how to properly wrap the mock around the original method):
@defer.inlineCallbacks
def test_graceful_crawl_error_handling(self):
crawler = get_crawler(SimpleSpider)
class TestError(Exception):
pass
from scrapy.core.downloader import Downloader
Downloader._orig_close = Downloader.close
def close_downloader(*args, **kwargs):
# Ugly work-around since I couldn't get supplying
# wraps=Downloader.close to the mock to work...
Downloader._orig_close(crawler.engine.downloader)
with mock.patch('scrapy.crawler.ExecutionEngine.start') as mock_es, \
mock.patch.object(Downloader, 'close') as mock_dc:
mock_es.side_effect = TestError
mock_dc.side_effect = close_downloader
with self.assertRaises(TestError):
yield crawler.crawl()
self.assertFalse(crawler.crawling)
self.assertTrue(mock_dc.called)
Now I got a different delayed call remaining after the test: <DelayedCall 0x7f069902cab8 [59.7916400433s] called=0 cancelled=0 LoopingCall<60.0>(LogStats.log, *(<SimpleSpider 'simple' at 0x7f06992bbd90>,), **{})()
. I figured (after a while) that I needed to close the spider through the engine. However, as ExecutionEngine.close_spider()
also closes the downloader, I need a case separation for the graceful shutdown. So here's what it looks like now:
try:
self.spider = self._create_spider(*args, **kwargs)
self.engine = self._create_engine()
self.addons.check_configuration(self)
start_requests = iter(self.spider.start_requests())
yield self.engine.open_spider(self.spider, start_requests)
yield defer.maybeDeferred(self.engine.start)
except Exception:
self.crawling = False
if self.engine is not None:
if self.engine.spider is not None:
self.engine.close_spider(self.spider)
else:
self.engine.downloader.close()
raise
Both tests pass with this, but I think there are some issues . First, the except
clause shouldn't have to figure out in which line the error occured. This is probably better (exception handling) style but quite ugly:
try:
self.spider = self._create_spider(*args, **kwargs)
self.engine = self._create_engine()
try:
self.addons.check_configuration(self)
start_requests = iter(self.spider.start_requests())
yield self.engine.open_spider(self.spider, start_requests)
except Exception:
self.engine.downloader.close()
raise
try:
yield defer.maybeDeferred(self.engine.start)
except Exception:
self.engine.close_spider(self.spider)
raise
except Exception:
self.crawling = False
raise
Maybe ExecutionEngine
could get a new close()
method that does the logic and decides whether it needs to stop the engine, close the spider, or close the downloader (or the stop()
method could be updated)?
Then, there's the test itself. It should test behaviour and not implementation: I don't really care whether Downloader.close()
was called, I just want to know whether the reactor is clean. But the downloader doesn't have a status
attribute or something. I could check whether the Downloader._slot_gc_loop
was stopped, but again that would check implementation not behaviour.
Maybe it's best to remove the Downloader.close()
mock and rely on twisted.trial
to check whether the reactor is clean after the error shutdown?
Last thing about the test: I'd refrain to use self.assertFailure() if possible, it's something present in trial and not in the builtin unittest that can be omitted. This should work too:
Hm, why do we subclass trial's TestCase
but then refrain from using its methods? I have (temporarily?) switched back to assertFailure
because the twisted version in precise
does not support using assertRaises
as a context manager, and replacing trials's TestCase
with the built-in one lets the tests pass but warns me that LogCapture
instances are not uninstalled by shutdown (and I wasn't sure how bad that is ;))
Hey everyone, thanks again for the feedback (especially for the full review @curita :)), I really appreciate it. Here's my two cents on some of the design issues (once again, caution, wall of text):
After I had to discuss the add-on management in a blog post a couple of days ago, I have backpedaled a little on my preference. scrapy.cfg
is a place for those settings that are not necessarily bound to any particular project (if I have multiple projects I probably still want to upload them to the same scrapyd
server). Add-on configuration is (most commonly) bound to a project, much like other crawling settings, and I agree that it should therefore live inside the project. If we don't want to introduce a new settings file (which I think would be a bad idea), that leaves us with the project settings module.
I see two possible syntaxes for this:
Each add-on has its own variable, like it is implemented right now, and is not exposed into the global settings:
addon_MYPIPELINE = dict(
_name = 'path.to.mypipeline',
user = 'blah',
...)
addon_MYEXT = dict(...)
There is an ADDONS
dictionary setting that lives among all other global settings:
ADDONS = {
'path.to.mypipeline': {
'user': 'blah',
...
},
'path.to.myext': {...},
}
# or
ADDONS = {
'path.to.mypipeline': dict(
user = 'blah',
...
),
'path.to.myext': dict(...),
}
The first approach does not mix settings and add-ons, following the 'add-ons are one level above settings' philosophy I had in mind earlier. However the syntax is not very user-friendly, particularly having to use the _name
key. The second approach puts add-ons onto the same level as the settings they are intended to fiddle with. I don't see an immediate drawback of this, but it strucks me as odd design. I also find the nested dictionaries, like these used for logging.config.dictConfig()
, a little tedious. The user experience would be much more cumbersome than what we had initially hoped for. The big big plus is that it would integrate really well with custom Spider settings.
Now the user might want to enable an add-on system-wide, and I think we could keep the support for scrapy.cfg
add-on configuration for this use case only. However, the docs should emphasise that this configuration does not belong to a project, and is therefore not deployed.
I agree that there should be a way to enable and configure add-ons for stand-alone spiders. That would be no problem at all when using syntax 2 from above, so I guess that is what it will boil down to. We could keep syntax 1 if we introduced some new mechanism to enable/configure add-ons from within a spider, but that seems unnecessarily confusing. And frankly, I think syntax 1 is just as ugly, if not a little more, as syntax 2. ;D
First I want to point out that the (local) add-on configuration is not inaccessible for other add-ons or components. All add-on configurations are available to all components via the crawler.addons.configs
dictionary. They just no longer clog up the global settings namespace. That being said, for backwards compatibility we should definitely keep the global settings for all builtin components. The idea I have for the builtin add-ons right now is that they simply take their configuration and expose it into the global settings namespace, much like in scrapy.addons.Addon.update_settings()
. The components themselves won't be touched at all. This would allow users to configure built-in add-ons/component through both the new (add-on config) and old (settings module) way, and keep overwriting component configuration ad-hoc via Spider settings or the command line possible.
Related with what I mentioned before, if we loose settings locality, is it required to instantiate the addons so early in the loading process? Could update_settings() and check_configuration() be class methods?
I'm not quite sure what you mean, aren't the callbacks already class methods in the common use case (i.e. when the add-on interface is provided by a class instance, such as in the scrapy.addons.Addon
class)? I have used the call to Spider.update_settings()
as orientation where the add-on callbacks should be placed. An add-on might want to replace the stats class, so I guess one line below where it is right now is where we should call them at latest. Then again, I just realised there's already some settings that cannot be changed in add-ons right now, like the spider loader class. I'm curious what you had in mind?
+1 for dropping the requirement of check_configuration()
, it is indeed very conceivable that many add-ons won't want to use this callback.
I like the additional attributes for further describing the add-on, and reusing setuptools
/distutils
/packaging
etc., in particular for replacing the clumpy version check function I coded (I wasn't finished, I swear ;), but of course I could've figured that there's got to be a much better implementation ready out there).
As to the additional field for load order, i'm not really sure if that really helps us. Opposed to the current implementation of component orders, the user would have no (easy) influence on the add-on load order. That means there needs to be some kind of communication between devs, or a guide with recommended load order numbers for different use cases. While that might be doable, it seems to me that this will become hard to maintain if full-blown dependency management is introduced at some point.
I'm just keeping this in here as a reminder to myself to think about it. As I outlined earlier, I think as soon as the first update_settings()
callback is touched, depending on whether we want to introduce the load order attribute, add-on configuration should be fixed. This only leaves the option of introducing a third (optional) callback, configure_addons()
, that is called before update_settings()
.
7eb837d is a rough version of my idea for the builtin add-ons, and it'd be cool to know if this is in line with the maintainer's ideas before I start writing tests and documentation.
In principal, the add-ons don't do much. All they do is act as a gateway to the 'traditional' settings. Neither the components nor Scrapy's default settings have been touched, so configuring components via the old way still works as expected. Additionally, it allows changing the settings via the add-on mechanism. That is, if I want to enable and the configure the http cache middleware, instead of adding this to my settings.py
:
HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 60
HTTPCACHE_IGNORE_HTTP_CODES = [500, 503]
I can now alternatively use this in my scrapy.cfg
(or the equivalent in the ADDONS
setting as outlined in the comments above):
[addon:httpcache]
expiration_secs = 60
ignore_http_codes = 500,503
; enabled is already implicitly set through the existence of this section
I think this is close to the idea of the original SEP-021, with the addition that each add-on has its own section.
If there ever comes a new milestone release where backwards-incompatible cleanup is encouraged, the add-on configuration could be removed from the global settings and live within the add-ons/components instead, making them fully independent of any Scrapy/settings ecosystem. Components with interdependencies could still expose some settings into the global settings namespace, or access other add-ons settings through the add-on manager. Or, as they are interdependent anyways, they could be configured through the same add-on.
Sorry I took so long to reply, I'm still unsure on the alternatives for the add-ons configuration in settings.py
, but I'll comment some thoughts I have about this matter. Most of them are things you already said but I wanted to write them to get my head around this.
Add-on configuration entry point
I agree that it seems an odd design decision having the add-ons config in the settings file, but it definitely simplifies things. Out of both presented options, the ADDONS
variable seems the more flexible one. I'm not crazy about it but I don't mind configuring add-ons this way in settings.py
, django uses this approach with multiple nested dicts a lot.
What I'm unsure of is that I think we're overcomplicating how current components are configured. We would be changing these:
scrapy crawl myspider -s HTTPCACHE_EXPIRATION_SECS=60
class MySpider(scrapy.Spider):
name = 'myspider'
custom_settings = {
HTTPCACHE_EXPIRATION_SECS=60,
}
to these:
scrapy crawl myspider -s ADDONS={'httcache':{'expiration_secs': 60}}
class MySpider(scrapy.Spider):
name = 'myspider'
custom_settings = {
'ADDONS': {
'httpcache': {
'expiration_secs: 60,
}
}
}
and I think that's sort of an usability step back.
I wouldn't mind doing this in external components that I probably won't be configuring too often, but for builtin components I'm not so sure, this could become too tedious too soon. I know we're still supporting the current global settings, but if we're going to favor a single config entry point that one should be as carefully considered as possible.
For this reason I started to prefer your first design option since it looses the furthermost enclosing dictionary, but a) the add-ons configurations should be loaded with the other settings so they can be configured using commandline and custom_settings
, though I think adding them as ADDONS_*
isn't much of a problem besides cluttering a little the settings, and b) the ADDONS_
prefix isn't really much of a difference from the other approach, it's a workaround for defining a namespace.
So, I dislike a little having to explicitly tell you're configuring an add-on if that can be deduced some other way. A third option could be to copy some other functionality from django here (I wonder how many times that has been said in scrapy's history :smile:), I'm thinking something like:
INSTALLED_ADDONS = (
'scrapy.addons.builtins.depth', # or 'depth' maybe
'scrapy.addons.builtins.httperror',
... # all builtins, this could be added to the settings.py template
'path.to.mypipeline',
)
MYPIPELINE = { # uppercased add-on name
'user': 'blah',
...
)
Not sure if possible though, but I kind of like copying the INSTALLED_APPS django setting, and the separated add-on config looks a lot like current settings:
HTTPCACHE = {
'enabled': True,
'expiration_secs': 60,
'ignore_http_codes': [500, 503],
}
Add-on callback placement
Don't mind the comment I made before, forgot how zope.interface.Interface
worked and thought it was limiting update_settings
and check_configuration
to be only instance methods. Then I thought about having the possibility to call those methods without instantiating the class (if those are classmethod
s) but I don't see any advantage over instantiating it and calling them later. I like the current callback placement.
Builtin add-ons
I really like the overall implementation minus a detail: I pictured adding components to their related component-dict settings slightly differently. I was thinking add-ons managing a single component could have a type
and order
fields in the interface (maybe order
can be a config too) that could be used for automatically inserting themselves (or their path) to the component-dict setting defined in type
.
I realize these builtin add-ons are sort of encapsulating the actual components, they have to add the components' paths to the settings and not themselves, but maybe the inserted path/obj could be parametrized for this particular case.
Forgot to comment the error handling in Crawler.crawl()
:sweat:
I don't have a strong preference here, I don't think your last implementation dealing with failures in every deferred looks that ugly. Maybe all those try/excepts could be flattened by handling callbacks manually (i.e. without inlineCallbacks
), adding the corresponding errbacks for exception handling to the deferreds, but I think your version is more readable. I like either updating the engine.stop()
method or adding a new engine.close()
too.
For monitoring pending calls, I like relying on trial
to check whether the reactor is clean, seems simpler and I think I read this was a good practice in the twisted testing guidelines. trial
probably calls reactor.getDelayedCalls() for this, maybe we could call it explicitly but I think leaving this to the testing suite is probably the best option.
About assertFailure
, we have tried to get rid of trial
sometime in the past because it wasn't fully python3 compatible. Removing the trial.TestCase
inheritance wasn't possible though because the pytest plugin we're using to mimic trial
behavior doesn't work on unittest.TestCase
classes. But seeing how we're already using assertFailure
in other tests, and how we probably won't be changing that inheritance now I agree there isn't any problem using it.
A few small updates/thoughts:
I opted for a new ExecutionEngine.close()
method instead of updating the stop()
method. The reason is that I didn't want to remove the assert
statement made in ExecutionEngine.stop()
(after all the whole idea was to clean up even if the engine wasn't running yet).
Code: close()
method and usage in Crawler.crawl()
I've also cleaned up the graceful shutdown test.
I've enhanced base Addon
class: It is now much easier to configure single components through the component
, component_type
, and component_key
attributes and the export_component
method. Thank you for the idea @curita, does that look somewhat like what you had in mind? I've also updated the built-in add-ons to make use of that new feature, and they look much tidier now.
@curita makes a very valid point with how the ADDONS
dictionary is a usability stepback. I like the Django-like INSTALLED_ADDONS
approach and have added a corresponding method and test alongside the old methods. A few thoughts:
INSTALLED_ADDONS
setting from the command line; not quite sure how to overcome that...append
in their Spider's update_settings
ADDONS
and INSTALLED_ADDONS
, however that would result in three entry points to the add-on framework. Not very streamlined :/@kmike mentioned in one of his first posts if it would be possible that Spiders also implement the add-on interface and have their methods called through the same code path. I very much like that idea but so far couldn't wrap my head around a couple of questions: Should we support config
for spiders and what would it be? Should the Spider be added to the add-on manager? Or should the add-on manager update_settings()
etc. method have an optional keyword argument where the Spider is passed to call its callbacks?
Now that I've coded around the add-on manager quite a bit, I think @kmike was indeed right in that it's not very useful to save the used_keys
, I think I'll drop them soon.
I have refactored the check_dependency_clashes()
method so that it now uses pkg_resources
. But I wonder if we can make even more use of that package?
Last, I realise that this PR is getting very large (+2,000 lines, phew) and therefore hard to review. However I don't see how it could be split up into smaller chunks. Any ideas how I could make it more clearly laid out?
I opted for a new ExecutionEngine.close() method instead of updating the stop() method. [...]
Love how clean that solution resulted, if you want you can open a separate pull request with it and its test and we can merge it right away.
I've enhanced base Addon class: It is now much easier to configure single components through the component, component_type, and component_key attributes and the export_component method. Thank you for the idea @curita, does that look somewhat like what you had in mind?
That was pretty much what I had in mind :) Not sure if I'd have added the abbreviated versions of the component_type
s though, they seem rather cryptic for the sake of saving keystrokes, I'd prefer users to stick with the fullnames. Maybe I'd have added the component order
as a class attr to match both component
and component_key
placements (in addition of being a config option), but this isn't necessary, and it could be taken the wrong way by thinking the order can't be changed.
A stumbling block with this approach [INSTALLED_ADDONS] is that it's not possible to append, not overwrite, to the INSTALLED_ADDONS setting from the command line; not quite sure how to overcome that...
That's a prevailing problem with any list valued setting :( I think as long as we can enable or disable add-ons using their configs (not sure if this is true for all add-ons though, maybe we could support a general enabled
config in AddonManager?) it isn't a big drawback to have a fixed INSTALLED_ADDONS in settings.py listing all add-ons used across the spiders in a project. This is far from ideal (dynamically adding/removing add-ons to the list will be hard) but I think it's a reasonable compromise.
@kmike mentioned in one of his first posts if it would be possible that Spiders also implement the add-on interface and have their methods called through the same code path.
I don't recall right now any constrain on spiders being add-ons (hm, wait, calling the spider update_settings() at the same time as the other add-ons update_settings() couldn't conflict with the add-ons config? could a spider still disable another add-on this way?), but if there aren't any issues (f8cdf1a?) let's implement it :+1:
Should we support config for spiders and what would it be?
If possible, I'd like to see spider arguments mapped into the config. A neat side effect of this would be that spider arguments could be configured like any other regular setting.
Should the Spider be added to the add-on manager? Or should the add-on manager update_settings() etc. method have an optional keyword argument where the Spider is passed to call its callbacks?
The second option is the one you implemented in f8cdf1a right? I like it, I think this distinction should be helpful at some point.
I have refactored the check_dependency_clashes() method so that it now uses pkg_resources. But I wonder if we can make even more use of that package?
Not sure myself, I thought we could have a custom Addon class for add-ons defined inside packages and get version
and requires
from that package using pkg_resources but I'm not sure if that's even something that should be done with it.
Last, I realise that this PR is getting very large (+2,000 lines, phew) and therefore hard to review. However I don't see how it could be split up into smaller chunks. Any ideas how I could make it more clearly laid out?
Cleaning up the commit history should be a must to get the pr fully reviewed once it's done, but I don't see how its readability could be improved any further. Your code is really well documented so don't mind its extension too much :)
If possible, I'd like to see spider arguments mapped into the config. A neat side effect of this would be that spider arguments could be configured like any other regular setting.
(Comment moved to #1442)
I think as long as we can enable or disable add-ons using their configs (not sure if this is true for all add-ons though, maybe we could support a general enabled config in AddonManager?)
calling the spider update_settings() at the same time as the other add-ons update_settings() couldn't conflict with the add-ons config? could a spider still disable another add-on this way?
I added the possibility to disable any add-on by setting its config _enabled
setting to False
, that should allow disabling add-ons both from the Spider's update_addons()
method as well as from the command line.
Once again, since there is currently no guarantee in which order add-on callbacks are called, setting an add-ons config['_enabled'] = False
from a second add-on may not prevent the first add-ons update_addons()
method from being called. However, it just struck me that, should we settle for the INSTALLED_ADDONS
list setting, the user now has the possibility to order add-ons, so maybe I could replace the AddonManager._addons
dictionary with an ordered dictionary (also, INSTALLED_ADDONS
could be a dictionary-like {'addonpath': order}
setting instead of a list, just like the component settings. That would also resolve the "not easy to append instead of overwrite" problem).
Note to self so I don't forget to remove some code later: 2039892 made the _source
config superfluos
83.42%
Merging #1272 into master will increase coverage by +0.69% as of
48b077b
Powered by Codecov. Updated on successful CI builds.
hm, wait, calling the spider update_settings() at the same time as the other add-ons update_settings() couldn't conflict with the add-ons config? could a spider still disable another add-on this way?)
(Comment moved to #1442)
Rebased onto #1149 and squashed the earlier commits (up until a0816fa) into groups to ease reviewing :)
Ready for review! I know it's huge, sorry :)
Rebased.
While this is a 'ready' implementation, there are some design questions that should be reconsidered.
Bold alternatives mark design changes I have implemented in later commits.
Major:
What | Current Implementation | Alternatives |
---|---|---|
Add-on configuration entry point | Preferredly, add-ons are enabled through a tuple setting, INSTALLED_ADDONS (inspired by Django), and configured through an ADDONNAME setting. Additionally (but not preferred), add-ons can also be enabled and configured with one section per add-on in scrapy.cfg . |
|
Inbetween:
What | Current Implementation | Alternatives |
---|---|---|
Add-ons enabling/configuring other add-ons | In a update_addons() callback (and only there), add-ons can load/enable and configure other add-ons. They cannot use the INSTALLED_ADDONS setting but must go through the add-on manager methods. |
|
Callback signatures | The callback signature are quite restrictive to avoid add-on authors misusing them, i.e. the update_addons() callback only receives the add-on manager but not the settings, the update_settings() callback gets only the settings but not the crawler or the add-on manager. |
|
Add-ons for builtin extensions/middlewares | For every builtin extension/middleware, there is a corresponding add-on. This add-on simply proxies the configuration into the global settings. Users can now do most of the configuration in scrapy.cfg through these proxy add-ons if they wish |
|
Dependency management | Add-ons can provide information about their dependencies through provides , modifies and requires attributes. The check_dependency_clashes() function checks these for clashes with the help of pkg_resources . |
|
Minor:
What | Current Implementation | Alternatives |
---|---|---|
Add-on manager name | AddonManager |
I totally see your point @kmike, I just can't think of a better name ;) the add-on manager is very multi-purpose (enable addons, find addons, provide configuration entry point, check dependencies, call callbacks, ...). Maybe it is too monolithic? |
Finding add-ons | When passing httpcache to the AddonManager.get_addon() method, it will first look for a module/object httpcache on the Python path, then for projectname.addons.httpcache , then for scrapy.addons.httpcache . This allows slightly more user-friendly addressing of add-ons. |
|
Interface declaration | The scrapy.interfaces.IAddon zope interface only requires the existence of a name and version attribute. |
|
My humble opinion on these:
Since the summer, I've changed my mind about a few things quite a bit. Given that Scrapy is a framework rather than a do-it-all product with a shiny GUI, I think the focus really should lie on consistency and versatility instead of abandoning these for user-friendliness.
I reeeaally love the ease and syntax of managing add-ons from scrapy.cfg
. But I fear that allowing two configuration points will bite us in the ass at some point. Right now I tend toward completely dropping the scrapy.cfg
support (after all, how many users won't be able to handle the simple Python syntax in settings.py
)?
Somewhat as a consequence, I don't really see a reason to keep the built-in add-ons without scrapy.cfg
support. While the cfg
syntax would indeed be much more beautiful, within settings.py
there really isn't much advantage of this:
HTTPCACHE = {
'dir': 'blah',
'expiration_secs': 99,
}
over this:
HTTPCACHE_DIR = 'blah'
HTTPCACHE_EXPIRATION_SECS = 99
With no built-in add-ons, there also isn't much reason to keep the "prefix scrapy.addons.
" logic in get_addon()
either.
For the sake of versality, I think all callbacks should receive the crawler
object (or maybe (crawler, settings, addons)
as shortcuts for developers), and that the "add-ons configuring other add-ons" logic should stay in, with a fat note to developers that they cannot use the INSTALLED_ADDONS
settings but must use crawler.addons.add()
.
For consistency with middlewares/extensions, I think it makes sense to change the INSTALLED_ADDONS
tuple into a ADDONS
dictionary, with the order existent but probably not really relevant (as with EXTENSIONS
). This should also make it easier to disable add-ons from the command line.
Just making sure I haven't deviated too much from the original idea while working on this (yay for tunnel vision):
The main goals of add-ons are:
ADDONS
setting (and possibly one other setting per add-on where configuration for that add-on lives), and no longer need to touch DOWNLOADER_MIDDLEWARES
and friends.Add-ons could then be defined as follows:
Add-ons are pluggable configuration units. An add-on contains a set of configuration settings and checks. All Add-ons are enabled at a single entry point.
Everyone d'accord? :)
I'm not super-familiar with the original idea, but this description sounds perfect, +1 to have an implementation in Scrapy.
A nice table, makes it much easier to understand the PR :+1:
check_configuration
methods (?) and then add helpers for common cases.Add-on manager name - What do you think about the following?
load_dict
to from_dict
and make it a classmethod (an alternative constructor);load_settings
to from_settings
and make it a classmethod (an alternative constructor);load_cfg
to from_cfg
and make it a classmethod (one more alternative constructor)(2-5) is very minor, to decrease usage of mutable state.
myproject.addons.Foo
is better than Foo
because it clearly shows that addon is a custom one defined in this project. With setuptools entrypoints there is no need for scrapy.addons, and pypi packages can provide shortened addon names. But maybe we don't even need entry points, just Python paths can be fine.Callback signatures - I like that addons don't know about a crawler until check_configuration is called. It makes it possible to build an almost final Settings object without a Crawler (?). Are there use cases for passing a Crawler earlier?
There might be some edge cases, like needing access to the signals really early, although I don't see a concrete example right now. Another edge case might be monkeypatching the crawler methods by overwriting them... Passing the add-on manager to update_settings()
might be useful when the settings depend on other add-ons' configurations.
I don't think these outweigh the pros of being able to construct an almost final (why almost?) settings object without a crawler though.
Add-ons for builtin extensions/middlewares - the implementation doesn't look DRY - priorities and default options are currently duplicated (?). I think it makes sense to convert extensions to addons only if they have a use of addon features. Maybe short addon names is enough to justify creating addons for built-in components though. But it should be DRY.
The default options still live in default_settings.py
, but the priorities are indeed duplicated. That was a preparation for being able to replace the XY_BASE
settings (or the defaults in XY
depending on how we proceed with #1149) with just a default ADDONS_BASE
(or ADDONS
) setting, but for now it should probably be removed (we can still do that later since the internal structure of the built-in add-ons is not public API).
Upon further thought I think the built-in add-ons should stay. After all if we promote a single configuration entry point for "functionality enhancements" by using ADDONS
over DOWNLOADER_MIDDLEWARES
and similar, we should also promote using
ADDONS = {
# ...
'httpcache': 0,
}
over
HTTPCACHE_ENABLED = True
Dependency management - it may be nice to have, but my vote is to do it later, in another PR.
+1
Add-on manager name - What do you think about the following? rename the class to Addons - it represents a set of addons, not something wich just manages a set of externally maintained addons;
I like that. Maybe AddonSet
? And load_dict
and similar could then be renamed to add_from_dict
, that would make it look more like Python's built-in set
. For ii-v, I don't see the advantage that would offset the hassle of having to merge different Addons instances.
Finding add-ons - My vote is to support 1. setuptools entrypoints and 2. python paths
+1 for supporting entry points. The scrapy.addons
prefix is purely for the built-in addons (so users can write httpcache
instead of scrapy.addons.httpcache
).
+1 to use ADDONS name instead of INSTALLED_ADDONS. Regarding priorities - hm, why don't they matter?
It's not that they don't matter at all, but that they probably don't matter as add-ons probably won't be cross-dependent too much. Just like the priorities of EXTENSIONS
often don't matter.
Rebased onto #1586 and implemented the following design changes:
INSTALLED_ADDONS
tuple with ADDONS
dictionary as entry point for add-on configurationscrapy.cfg
BaseSettings.get_addon()
. Enabling the http cache now requires ADDONS = {'scrapy.addons.httpcache': 0}
instead of ADDONS = {'httpcache': 0}
. This makes it consistent with the other component settings and prevents name shadowing issuesCompletely dropped support for configuring add-ons in scrapy.cfg
+1
Replaced INSTALLED_ADDONS tuple with ADDONS dictionary as entry point for add-on configuration
hm, -1 if the only reason of this change is "so it is consistent with the middleware/extension settings"
Can you point me to the part of doc which describes how this cross-configuration is going to work?
generally I like the idea of being more strict, but it should be justified - what could go wrong if we give
The callback signature are quite restrictive to avoid add-on authors misusing them
what is understood by misuse? example?
agree with @kmike https://github.com/scrapy/scrapy/pull/1272#issuecomment-152682302
also one of the current implementation point is irrelevant now as scrapy.cfg
support is dropped - "Users can now do most of the configuration in scrapy.cfg through these proxy add-ons if they wish"
+1 to separate PR, agree with @kmike https://github.com/scrapy/scrapy/pull/1272#issuecomment-152682302
Honestly don't see a big conceptual difference between AddonsManager
and Addons
as far as it doesn't change how it works :)
(2-5) is very minor, to decrease usage of mutable state.
load_cfg
is already removed, so there are only load_settings
and load_dict
to decide. load_settings
is used once during Crawler.__init__
. what's expected usage of load_dict
? I see implementation, what is expected usecase?
Enabling the http cache now requires ADDONS = {'scrapy.addons.httpcache': 0} instead of ADDONS = {'httpcache': 0}
+1 to remove damn magic
@jdemaeyer can you explain what's the problem with current implementation?
Thanks alot for the feedback you guys!
Replaced INSTALLED_ADDONS tuple with ADDONS dictionary as entry point for add-on configuration
hm, -1 if the only reason of this change is "so it is consistent with the middleware/extension settings"
It's basically the same rationale with which ITEM_PIPELINES
became a dictionary and no longer a list. I have to admit that I still like the idea of ADDONS
being a tuple/list though. As the "update-not-replace" is behaviour has been revoked (#1586), lists/tuples are again not harder to update from custom_settings
or the command line than dicts. On the other hand, being able to disable components by setting their dictionary value to None
comes in handy at times.
Add-ons enabling/configuring other add-ons
Can you point me to the part of doc which describes how this cross-configuration is going to work?
Here's an example: https://github.com/jdemaeyer/scrapy/blob/enhancement/addons/docs/topics/addons.rst#enabling-and-configuring-add-ons-within-python-code
Callback signatures
generally I like the idea of being more strict, but it should be justified - what could go wrong if we give what is understood by misuse? example?
I don't think there's much that could really "go wrong", it's more an notion of guiding people. E.g. in update_addons()
, they should not make decisions based on setting values, as these setting values are still subject to change (e.g. in other add-ons' update_settings()
).
Not putting the crawler into the callback signature has the advantage pointed out by Mike above (being able to construct a final settings object without needing a crawler).
Add-ons for builtin extensions/middlewares
I've removed the repetition of the default settings. Most of the built-in add-ons only provide a second way to configure the built-in extensions, I think we should drop these.
There are some that will overwrite default settings when activated. E.g. enabling the httpcache
add-on will automatically set HTTPCACHE_ENABLED = True
. What do you guys think about these? I think they make sense when we want to push users towards enabling extra functionality (built-in or not) through the ADDONS
setting.
what's expected usage of load_dict? I see implementation, what is expected usecase?
It's a low-maintenance convenience helper for working with the AddonManager
. There's no use in Scrapy itself but I've used it in some of the tests.
Interface declaration
can you explain what's the problem with current implementation?
It's a minor detail really, I just wanted the list to be complete ;) I don't want to enforce that add-ons have an update_settings()
method (they might not need one after all). But when they have one, I want to enforce our call signature. However I don't think that's possible with zope.interface
right now.
Rebased onto master, this PR now once again contains no commits from other PRs. (Finally the green checkmark is back :))
this is so huge, is there some way to save pieces of this work by splitting this into more smaller PR-s? e.g. I'm looking forward to changes in load_object, is it ok if I steal it from here (cherry-pick) and create new PR with that? https://github.com/scrapy/scrapy/pull/1272/files#diff-2536f98aac2c474ec1f424c8a943b9d2
Based on
#1149 and #1423#1586masterCloses #591 and #1215
Implementation of SEP-021 (updated version from this PR, original version)
Design decision to-do here
Old to-do:
INSTALLED_ADDONS
setting and additional support forscrapy.cfg
so far)AddonManager
?Addon
classscrapy.cfg
get_project_path()
robustness (done partially, still fails when settings module does not live in project's root)Add version supportUsepkg_resources
' extensive dependency managementShouldWill go to separate PR if we decide topkg_resources
be used throughout the manager and not just within the dependency checking?PIPELINES
(check #1215)Move settings for built-in components from global namespace to add-ons (is this possible without making them harder to configure?)Allow setting global namespace settings via add-on frameworkupdate_settings()
)Addon
classcfg
filescheck_configuration()
AddonManager
, then use autodoc indocs/topics/api.rst
_addon
attribute, more versatile examples (using_addon
, module as add-on)