scrapy / scrapy

Scrapy, a fast high-level web crawling & scraping framework for Python.
https://scrapy.org
BSD 3-Clause "New" or "Revised" License
51.16k stars 10.35k forks source link

Add-ons #1272

Closed jdemaeyer closed 9 months ago

jdemaeyer commented 8 years ago

Based on #1149 and #1423 #1586 master

Closes #591 and #1215

Implementation of SEP-021 (updated version from this PR, original version)

Design decision to-do here

Old to-do:

Review on Reviewable

jdemaeyer commented 8 years ago

I think one of the major open questions is where the add-on manager should live. Here's an excerpt from my proposal on that (I have opted for the inside of crawler option for the SEP so far):

Integration into existing start-up process

The question where the add-on manager should be instantiated and when the add-on callbacks should be called is not as obvious as it may seem at first glance. Outlined below are three options and their rationale:

Outside of crawler

Besides the different priority and level of direct user involvement, settings made by the add-on callbacks are not different from settings made in settings.py. The callbacks should therefore be called around the same time that the settings module is loaded. This puts the instantiation of an AddonManager instance outside of the Crawler instance.

Given that the project settings are read in the get_project_settings() function in scrapy.util.project, this seems a reasonable place to call AddonManager.update_settings(). However, we cannot instantiate the add-on manager within this function, as the function is left (and the manager would therefore be lost) long before the crawler becomes ready (when we wish to call the second round of add-on callbacks).

There are two possible approaches to instantiating the add-on manager outside of get_project_settings():

  1. In a normal Scrapy start via the command-line tool, calling get_project_settings() is embedded into the execute() function in scrapy.cmdline. In summary, the changes to this function would be (with analogoue changes in scrapy/conf.py, where necessary, for backwards-compatibility):
    1. Instantiate add-on manager before calling get_project_settings()
    2. Pass add-on manager to get_project_settings() when calling it (the function then calls update_settings()).
    3. Connect the manager's check_configuration method to the engine_started signal (this could also be done in the add-on managers 'init()' method)
  2. Alternatively, we could (grudgingly) introduce another singleton to Scrapy (e.g. scrapy.addonmanager.addonmanager). This would allow moving the above code related to add-on management above into the more appropriate get_project_settings() function.

Integrating add-on management outside of the crawler ensures that settings management, except for spider-specific settings, remains within a rather small part of the start-up process (before instantiating a Crawler): the helper function get_project_settings() indeed keeps returning the full set of project (i.e. non-spider) settings. The downside is that it either introduces a new singleton or clutters up a function (execute()) that should be more about parsing command line options and starting a crawler process than about loading add-ons and settings.

Inside of crawler

The settings used to crawl are not complete until the spider-specific settings have been loaded in Crawler.__init__(). Add-on management could follow this approach and only start loading add-ons when the crawler is initialised.

Instantiation and the call to AddonManager.update_settings() would happen in Crawler.__init__(). The final checks (i.e. the callback to AddonManager.check_configuration()) could then again either be tied to the engine_started signal, or coded into the Crawler.crawl() method after creating the engine.

Integrating add-on management inside of the crawler avoids introducing a new singleton or cluttering up the execute() function, but rips apart compiling the complete configuration settings. This is especially critical since many of the settings previously made in settings.py will move to scrapy.cfg with the implementation of this proposal, and may prove backwards-incompatible since the get_project_settings() helper function no longer works as expected (and as its name suggests).

kmike commented 8 years ago

Hey @jdemaeyer,

I've read your proposal and the SEP for the first time, so please excuse me if I'm missing some context :)

  1. Is it a coincidence that scrapy.Spider implements update_settings method which addons will implement?
  2. Addons are utilities to auto-update and check Settings objects? Why do they need Crawler?
  3. How can user activate an addon when crawling is started by CrawlerRunner or CrawlerProcess?
  4. How can user enable an addon if there is no Scrapy project?
  5. How to enable an addon if scrapy runspider is used to start a spider?
  6. Why is scrapy.cfg needed? It is confusing to have two config files, and it is an unnecessary boilerplate if you don't have a Scrapy project.
  7. In the SEP it is said that MiddlewareManager.from_settings() will be changed to allow python objects instead of class paths. Why should it be specific to MiddlewareManager? (see discussion at https://github.com/scrapy/scrapy/pull/1215).
  8. Do you think it is possible to make Scrapy addons feature less framework-ish? In the current proposal Scrapy calls various AddonManager methods during the standard scrapy crawl init sequence, and AddonManager calls various addon methods itself. Can we made it so that user can also use the AddonManger (or individual addon) to update or check settings, without invoking all the scrapy crawl machinery?
  9. It looks like the proposed AddonManager does two unrelated things:

    • it loads addon classes from scrapy.cfg;
    • it uses loaded addons to update settings and check crawler.

    IMHO it is better to separate those. Scrapy users who don't use scrapy crawl command may prefer to import addon classes themselves, or these classes can be in the same module as CrawlerProcess call and a spider, all in the same self-contained Python script.

// a rant: I think it is usually a bad idea to name somathing ..Manager :) "Manager" tells us that a class "manages" something. It is quite broad, so 1) often such classes end up doing several unrelated things and 2) it is not clear what class is doing just from the class name - it can do anything. In Scrapy we renamed SpiderManager to SpiderLoader because of that. It turns out this is not only mine opinion.

jdemaeyer commented 8 years ago

Hey @kmike, thanks for the feedback!

1 . Is it a coincidence that scrapy.Spider implements update_settings method which addons will implement?

That is on purpose. Spider.update_settings() and the add-on update_settings() both do roughly the same, i.e. update settings outside of settings.py, albeit with a different level of "generality"

2 . Addons are utilities to auto-update and check Settings objects? Why do they need Crawler?

Add-ons should create a plug-and-play experience for the user. For checking that everything "Works As Advertised™", and provide help when it doesn't, they need to know the actual state of the crawler, and not just how its state is supposed to be from the configuration. E.g. an extension listed in settings['EXTENSIONS'] could have thrown NotConfigured. Add-ons can only detect this if they have access to the crawler.

3 . How can user activate an addon when crawling is started by CrawlerRunner or CrawlerProcess?

Hm, don't both of these instantiate Crawler objects where the add-on manager would then be instantiated and load add-ons? Then again I guess it again comes down to whether a scrapy.cfg exists in these use cases (also see answer to 6).

4 . How can user enable an addon if there is no Scrapy project? 5 . How to enable an addon if scrapy runspider is used to start a spider?

That would not be possible with the current suggested implementation. Just like you cannot have a settings.py file or use scrapyd-deploy in these cases. It could be made possible if add-on configuration were to live in settings using per-spider settings. I'll put the (current) rationale for scrapy.cfg below.

6 . Why is scrapy.cfg needed? It is confusing to have two config files, and it is an unnecessary boilerplate if you don't have a Scrapy project.

I opted for scrapy.cfg for two reasons:

  1. Its syntax is more user-friendly for uninitiated users
  2. It makes it easy to keep settings local through the ini sections. E.g. something like this:

     [mysql-pipe]
     server = some.server
     port = 1234
     user = a
     password = b
    
     [mongodb-pipe]
     server = some.other.server
     user = c
     password = d

    is much easier in scrapy.cfg than in settings.py, and would not have to expose anything into the settings object, avoiding possible nameclashes.

I think if done properly, users won't have to touch settings.py for any of the shipped add-ons at all (instead they could set, say HTTPCACHE_DIR, in the scrapy.cfg. We could even tidy up the settings namespace this way). Those users that then need to touch settings.py will most probably know what they're doing anyways.

7 . In the SEP it is said that MiddlewareManager.from_settings() will be changed to allow python objects instead of class paths. Why should it be specific to MiddlewareManager? (see discussion at #1215).

Uh, I didn't see #1215 before, that looks good. The SEP goes a little further than the PR though: It suggests to allow not only classes, but instances. This will allow add-ons to instantiate components and keep settings as local as possible. Objects retrieved from X = load_object() are (so far) always assumed to be classes and then instantiated with y = X(). That should probably stay that way, and it is a good question where the "allow instances" code could live so it is as generic as possible.

8 . Do you think it is possible to make Scrapy addons feature less framework-ish? In the current proposal Scrapy calls various AddonManager methods during the standard scrapy crawl init sequence, and AddonManager calls various addon methods itself. Can we made it so that user can also use the AddonManger (or individual addon) to update or check settings, without invoking all the scrapy crawl machinery?

Hm, I'd say it is already quite modular. The user could in principle instantiate their own AddonManager and hand it the .cfg file. Or they could load an add-on object and call its update_settings() method directly.

9 . It looks like the proposed AddonManager does two unrelated things:

  • it loads addon classes from scrapy.cfg;
  • it uses loaded addons to update settings and check crawler. IMHO it is better to separate those. Scrapy users who don't use scrapy crawl command may prefer to import addon classes themselves, or these classes can be in the same module as CrawlerProcess call and a spider, all in the same self-contained Python script.

That is a good point. The Manager idea arised from the #591 discussion, but now that I come to think of it splitting it up seems to make more sense. Also, it would avoid having a "...Manager" named class :). The loaded add-ons need to live somewhere between loading <-> updating <-> checking, though. With the current SEP, that could be in an attribute of Crawler (much like extensions).

Maybe like this?

I'm still quite unhappy about the scrapy.cfg/settings.py issue. It is indeed annoying overhead to have two configuration files. But then, I think it makes sense that add-ons and their configuration are not present in any Settings instance, and the scrapy.cfg syntax is very much suited for user-friendliness and local settings.

kmike commented 8 years ago

Hi @jdemaeyer,

That is on purpose. Spider.update_settings() and the add-on update_settings() both do roughly the same, i.e. update settings outside of settings.py, albeit with a different level of "generality"

Hm, do you think we can use exactly the same code path for Spider then? I.e. maybe spider can implement addon interface itself? I'm not sure when can it be useful, but why not :) This way a spider will be able not only to update settings, but also to check them.

It makes it easy to keep settings local through the ini sections. E.g. something like this: (... snip ...) is much easier in scrapy.cfg than in settings.py, and would not have to expose anything into the settings object, avoiding possible nameclashes.

I don't think it is a problem to separate sections in settings.py. Django-like approach works:

MYSQL_PIPE = dict(
    server='some-server',
    port=1234,
    user='foo',
    password='bar'
)

Maybe like this? ...

Yeah, your approach looks good to me.

Some questions:

Why are { 'addon name': (<addon object>, <addon config>) } dicts needed, what are keys ('addon names') for? Can it be a list of (<addon class>, <addon config>) pairs, or just a list of instantiated addon objects?

I'm still quite unhappy about the scrapy.cfg/settings.py issue. It is indeed annoying overhead to have two configuration files. But then, I think it makes sense that add-ons and their configuration are not present in any Settings instance, and the scrapy.cfg syntax is very much suited for user-friendliness and local settings.

What's the problem with having ADDONS list in settings.py? It means one less config file and more flexibility - addons will be able to disable or enable or check other addons. This way you also avoid changing Crawler API and other realted APIs - no need to add 'addons' argument. I think that two config files is less user-friendly than one config file. By default settings.py may contain only this ADDONS option.

jdemaeyer commented 8 years ago

I've written the add-on interface spec and two add-on loaders, one loading from scrapy.cfg and one loading from settings.py.

Is there a way I can retrieve the project module name? I could use the first part of os.env['SCRAPY_SETTINGS_MODULE'] but it seems to me that there should be a cleaner way.

Again, thanks for the feedback @kmike!

Hm, do you think we can use exactly the same code path for Spider then?

I think that should be possible, but I'm not sure how enforce that the update_settings() method is not called twice.

I don't think it is a problem to separate sections in settings.py. Django-like approach works:

One problem with the current SEP is that the add-on name (= variable name) is allowed to be a python path to the add-on, but python does not allow dots in variable names. I have therefore allowed a _name variable in the settings.py dicts for now.

Why are { 'addon name': (, ) } dicts needed, what are keys ('addon names') for? Can it be a list of (, ) pairs, or just a list of instantiated addon objects?

The keys are simply whatever the addon was called in scrapy.cfg/settings.py. I'm not yet fully sure if we can drop it this early so I'd prefer to leave it in. It cannot be a list of instantiated add-on objects with the current SEP b/c we also allow add-ons to be modules right now (i.e. cannot be instantiated)

I don't really like the idea of exposing add-ons in the global settings, because they are essentially "one level above it": Add-ons fiddle with settings and extensions, but never the other way around. That's why I've designed the settings module add-on loader to only load variables starting with addon_ (in lower case, so they're not picked up by Settings.setmodule()) so far. It would be cool to allow add-ons to load/configure other add-ons, though. In that case an add-on manager could become useful again...

jdemaeyer commented 8 years ago

After some more coding, I've deciced to again merge the AddonLoader and what I called the AddonHolder into an AddonManager. There are just too many cross-dependencies: I wanted the holder to still be able to load add-ons last minute (allowing add-ons to load other add-ons, though I'm not exactly sure whether we want to allow this) and check dependency issues, at the same time get rid of the confusing { used_key: (addon, configuration) } structure if the loader.

Now, the AddonManager provides two class methods that the user may use when she doesn't want to use the 'holding' part, and provides convenience methods to add add-ons in case she doesn't want to use the 'loading' part.

I have integrated the add-on system into Scrapy's start-up process (it currently reads from both scrapy.cfg and the settings module), but did no testing of a "fully integrated" system so far.

One issue currently in my head is: Do we allow add-ons to load/configure other add-ons? I have not fully thought through the implications of this. The main problem I see is that at the moment the first callback of an add-on is called (update_settings()), other add-ons might already have been called. I could see this work if we require update_settings() to not do any settings introspection/dependency checking/interaction with other add-ons whatsoever. Being able to load/configure an add-on at this point could be useful for "umbrella add-ons" that load other add-ons depending on their configuration

jdemaeyer commented 8 years ago

I've put some thought into the "add-ons loading/configuring other add-ons" issue. The big problem with the current implementation is that the first time code of an add-on is executed is during the first round of callbacks to update_settings(). Should an add-on load or reconfigure another add-on here, other add-ons might already have been called. While it is possible to ensure that the update_settings() method of the newly added add-on is called, there is no guarantee (and in fact, it is quite unlikely) that all add-ons see the same add-on configuration in their update_settings().

I see three possible approaches to this:

  1. Forbid add-ons from loading or configuring other add-ons. In this case 'umbrella add-ons' would not be possible and all cross-configuration dependencies would again be burdened onto the user.
  2. Forbid add-ons to do any kind of settings introspection in update_settings(), instead only allow them to do changes to the settings object or load other add-ons. In this case, configuring already enabled add-ons should be avoided, as there is no guarantee that their update_settings() method has not already been called
  3. Add a third callback, update_addons(config, addonmgr), to the add-on interface. Only loading and configuring other add-ons should be done in this method. While it may be allowed, developers should be aware that depending on config (of their own add-on, i.e. the one whose update_addons() is currently called) is fragile, as once again, there is no guarantee in which order add-ons will be called back and whether some other add-on will want to touch config.

Currently I have implemented option 2. I have not quite made up my mind about it just yet, but I think I option 3 is my favourite for now.

jdemaeyer commented 8 years ago

I'm trying to write a test that ensures the crawler breaks down when an add-on's check_configuration() callback raises an exception, but I don't know enough about Twisted to understand why it doesn't work :/

Here's the test, put into the CrawlTestCase in tests/test_crawl.py:

    @defer.inlineCallbacks
    def test_abort_on_addon_failed_check(self):
        class FailedCheckAddon(Addon):
            NAME = 'FailedCheckAddon'
            VERSION = '1.0'
            def check_configuration(self, crawler):
                raise ValueError
        addonmgr = AddonManager()
        addonmgr.add(FailedCheckAddon())
        crawler = get_crawler(SimpleSpider, addons=addonmgr)
        yield self.assertFailure(crawler.crawl(), ValueError)

And this is the output I get from tox:

====================================================== FAILURES ======================================================
___________________________________ CrawlTestCase.test_abort_on_addon_failed_check ___________________________________
NOTE: Incompatible Exception Representation, displaying natively:

DirtyReactorAggregateError: Reactor was unclean.
DelayedCalls: (set twisted.internet.base.DelayedCall.debug = True to debug)
<DelayedCall 0x7ff8a481a950 [59.7946248055s] called=0 cancelled=0 LoopingCall<60>(Downloader._slot_gc, *(), **{})()>

I've also tried assertRaises, that tells me:

FailTest: ValueError not raised (<Deferred at 0x7fea3d2b3e60 current result: <twisted.python.failure.Failure <type 'exceptions.ValueError'>>> returned)
curita commented 8 years ago

The twisted error when testing check_configure() is a little tricky, and it's actually something that wasn't handled before.

====================================================== FAILURES ======================================================
___________________________________ CrawlTestCase.test_abort_on_addon_failed_check ___________________________________
NOTE: Incompatible Exception Representation, displaying natively:

DirtyReactorAggregateError: Reactor was unclean.
DelayedCalls: (set twisted.internet.base.DelayedCall.debug = True to debug)
<DelayedCall 0x7ff8a481a950 [59.7946248055s] called=0 cancelled=0 LoopingCall<60>(Downloader._slot_gc, *(), **{})()>

This output means there's a pending call that wasn't stopped properly. In that case is a twisted LoopingCall started by the Downloader. The downloader is stored in the engine, so that call reference can be accessed in a crawler as crawler.engine.downloader._slot_gc_loop.

The engine is instantiated in Crawler.crawl(). Problem is, that call is already running when check_configure() is called and fails in FailedCheckAddon, but there is no place that addresses that and closes the downloader (which stops the looping call) properly. Not sure why that wasn't handled for errors happening after creating the engine, I guess it's not noticeable unless running tests that tell you about it.

So, there are two ways of fixing this. First one is to move the line calling addons.check_configuration() above the instantiation of the engine. This would fix the test but doesn't help solving the underlining problem. Second approach would be to close the downloader if any problem raises from the commands after the engine creation. Something like:

try:
    self.spider = self._create_spider(*args, **kwargs)
    self.engine = self._create_engine()
    self.addons.check_configuration(self)
    ...
   except Exception:
       self.crawling = False
       if self.engine is not None:
            self.engine.downloader.close()
       raise

should be enough. If you want to you can implement this fix in this pr (or in another one) or we can implement it at some other time and follow the first approach for now.

Last thing about the test: I'd refrain to use self.assertFailure() if possible, it's something present in trial and not in the builtin unittest that can be omitted. This should work too:

@defer.inlineCallbacks
def test_abort_on_addon_failed_check(self):
    class FailedCheckAddon(Addon):
        NAME = 'FailedCheckAddon'
        VERSION = '1.0'
        def check_configuration(self, crawler):
            raise ValueError
    crawler = get_crawler(SimpleSpider)
    crawler.addons.add(FailedCheckAddon())
    with self.assertRaises(ValueError):
        yield crawler.crawl()
curita commented 8 years ago

I think there are some things to discuss about the design, I took note on some questions and suggestions I had so we can talk them through.

There are some concerns about the usage of the scrapy.cfg file to set the addons configuration. I agree here with @kmike that we need to support using the current settings for a couple of reasons:

Both options, using the current settings and the scrapy.cfg files could coexist, but if there is any complication keeping both we should favor the first one.

On a related note, I'm not sure it's a good idea to discard _ENABLED settings (read about discarding them in the SEP), specially if we're keeping the possibility of dynamic configuration. Users may want to disable a component temporary without deleting or commenting out the whole .cfg section.

Some miscellaneous comments in the addons interface:

jdemaeyer commented 8 years ago

Pheew okay, improving the error handling in Crawler.crawl() took me an awful lot of time and I learned quite a bit about Deferreds and Scrapy's architectural details. I'm still really unhappy, so here's what I have, up for discussion. ;) (Sorry for the wall of text, I'm doing this also to sort out my thoughts)

So I put in the

       if self.engine is not None:
            self.engine.downloader.close()

in the Crawler.crawl() except clause as you suggested, and the test I had worked fine. But of course I wanted to introduce that code The Right Way™ so I wrote another test that checks whether the crawler shuts down gracefully when there's any error during the crawl startup process.

The test is in tests.test_crawl.CrawlTestCase.test_graceful_crawl_error_handling. I mocked ExecutionEngine.start to throw an exception and asserted that Crawler.crawling == False and that the downloader's close method has been called (although I couldn't figure out how to properly wrap the mock around the original method):

    @defer.inlineCallbacks
    def test_graceful_crawl_error_handling(self):
        crawler = get_crawler(SimpleSpider)
        class TestError(Exception):
            pass
        from scrapy.core.downloader import Downloader
        Downloader._orig_close = Downloader.close
        def close_downloader(*args, **kwargs):
            # Ugly work-around since I couldn't get supplying
            # wraps=Downloader.close to the mock to work...
            Downloader._orig_close(crawler.engine.downloader)

        with mock.patch('scrapy.crawler.ExecutionEngine.start') as mock_es, \
             mock.patch.object(Downloader, 'close') as mock_dc:
            mock_es.side_effect = TestError
            mock_dc.side_effect = close_downloader
            with self.assertRaises(TestError):
                yield crawler.crawl()
            self.assertFalse(crawler.crawling)
            self.assertTrue(mock_dc.called)

Now I got a different delayed call remaining after the test: <DelayedCall 0x7f069902cab8 [59.7916400433s] called=0 cancelled=0 LoopingCall<60.0>(LogStats.log, *(<SimpleSpider 'simple' at 0x7f06992bbd90>,), **{})(). I figured (after a while) that I needed to close the spider through the engine. However, as ExecutionEngine.close_spider() also closes the downloader, I need a case separation for the graceful shutdown. So here's what it looks like now:

        try:
            self.spider = self._create_spider(*args, **kwargs)
            self.engine = self._create_engine()
            self.addons.check_configuration(self)
            start_requests = iter(self.spider.start_requests())
            yield self.engine.open_spider(self.spider, start_requests)
            yield defer.maybeDeferred(self.engine.start)
        except Exception:
            self.crawling = False
            if self.engine is not None:
                if self.engine.spider is not None:
                    self.engine.close_spider(self.spider)
                else:
                    self.engine.downloader.close()
            raise

Both tests pass with this, but I think there are some issues . First, the except clause shouldn't have to figure out in which line the error occured. This is probably better (exception handling) style but quite ugly:

        try:
            self.spider = self._create_spider(*args, **kwargs)
            self.engine = self._create_engine()
            try:
                self.addons.check_configuration(self)
                start_requests = iter(self.spider.start_requests())
                yield self.engine.open_spider(self.spider, start_requests)
            except Exception:
                self.engine.downloader.close()
                raise
            try:
                yield defer.maybeDeferred(self.engine.start)
            except Exception:
                self.engine.close_spider(self.spider)
                raise
        except Exception:
            self.crawling = False
            raise

Maybe ExecutionEngine could get a new close() method that does the logic and decides whether it needs to stop the engine, close the spider, or close the downloader (or the stop() method could be updated)? Then, there's the test itself. It should test behaviour and not implementation: I don't really care whether Downloader.close() was called, I just want to know whether the reactor is clean. But the downloader doesn't have a status attribute or something. I could check whether the Downloader._slot_gc_loop was stopped, but again that would check implementation not behaviour.

Maybe it's best to remove the Downloader.close() mock and rely on twisted.trial to check whether the reactor is clean after the error shutdown?

jdemaeyer commented 8 years ago

Last thing about the test: I'd refrain to use self.assertFailure() if possible, it's something present in trial and not in the builtin unittest that can be omitted. This should work too:

Hm, why do we subclass trial's TestCase but then refrain from using its methods? I have (temporarily?) switched back to assertFailure because the twisted version in precise does not support using assertRaises as a context manager, and replacing trials's TestCase with the built-in one lets the tests pass but warns me that LogCapture instances are not uninstalled by shutdown (and I wasn't sure how bad that is ;))

jdemaeyer commented 8 years ago

Hey everyone, thanks again for the feedback (especially for the full review @curita :)), I really appreciate it. Here's my two cents on some of the design issues (once again, caution, wall of text):

Add-on configuration entry point

After I had to discuss the add-on management in a blog post a couple of days ago, I have backpedaled a little on my preference. scrapy.cfg is a place for those settings that are not necessarily bound to any particular project (if I have multiple projects I probably still want to upload them to the same scrapyd server). Add-on configuration is (most commonly) bound to a project, much like other crawling settings, and I agree that it should therefore live inside the project. If we don't want to introduce a new settings file (which I think would be a bad idea), that leaves us with the project settings module.

I see two possible syntaxes for this:

  1. Each add-on has its own variable, like it is implemented right now, and is not exposed into the global settings:

    addon_MYPIPELINE = dict(
       _name = 'path.to.mypipeline',
       user = 'blah',
       ...)
    addon_MYEXT = dict(...)
  2. There is an ADDONS dictionary setting that lives among all other global settings:

    ADDONS = {
       'path.to.mypipeline': {
           'user': 'blah',
           ...
       },
       'path.to.myext': {...},
    }
    
    # or
    
    ADDONS = {
       'path.to.mypipeline': dict(
           user = 'blah',
           ...
       ),
       'path.to.myext': dict(...),
    }

The first approach does not mix settings and add-ons, following the 'add-ons are one level above settings' philosophy I had in mind earlier. However the syntax is not very user-friendly, particularly having to use the _name key. The second approach puts add-ons onto the same level as the settings they are intended to fiddle with. I don't see an immediate drawback of this, but it strucks me as odd design. I also find the nested dictionaries, like these used for logging.config.dictConfig(), a little tedious. The user experience would be much more cumbersome than what we had initially hoped for. The big big plus is that it would integrate really well with custom Spider settings.

Now the user might want to enable an add-on system-wide, and I think we could keep the support for scrapy.cfg add-on configuration for this use case only. However, the docs should emphasise that this configuration does not belong to a project, and is therefore not deployed.

Add-ons without a project

I agree that there should be a way to enable and configure add-ons for stand-alone spiders. That would be no problem at all when using syntax 2 from above, so I guess that is what it will boil down to. We could keep syntax 1 if we introduced some new mechanism to enable/configure add-ons from within a spider, but that seems unnecessarily confusing. And frankly, I think syntax 1 is just as ugly, if not a little more, as syntax 2. ;D

Local configuration

First I want to point out that the (local) add-on configuration is not inaccessible for other add-ons or components. All add-on configurations are available to all components via the crawler.addons.configs dictionary. They just no longer clog up the global settings namespace. That being said, for backwards compatibility we should definitely keep the global settings for all builtin components. The idea I have for the builtin add-ons right now is that they simply take their configuration and expose it into the global settings namespace, much like in scrapy.addons.Addon.update_settings(). The components themselves won't be touched at all. This would allow users to configure built-in add-ons/component through both the new (add-on config) and old (settings module) way, and keep overwriting component configuration ad-hoc via Spider settings or the command line possible.

Add-on callback placement

Related with what I mentioned before, if we loose settings locality, is it required to instantiate the addons so early in the loading process? Could update_settings() and check_configuration() be class methods?

I'm not quite sure what you mean, aren't the callbacks already class methods in the common use case (i.e. when the add-on interface is provided by a class instance, such as in the scrapy.addons.Addon class)? I have used the call to Spider.update_settings() as orientation where the add-on callbacks should be placed. An add-on might want to replace the stats class, so I guess one line below where it is right now is where we should call them at latest. Then again, I just realised there's already some settings that cannot be changed in add-ons right now, like the spider loader class. I'm curious what you had in mind?

Add-on interface

+1 for dropping the requirement of check_configuration(), it is indeed very conceivable that many add-ons won't want to use this callback.

I like the additional attributes for further describing the add-on, and reusing setuptools/distutils/packaging etc., in particular for replacing the clumpy version check function I coded (I wasn't finished, I swear ;), but of course I could've figured that there's got to be a much better implementation ready out there).

As to the additional field for load order, i'm not really sure if that really helps us. Opposed to the current implementation of component orders, the user would have no (easy) influence on the add-on load order. That means there needs to be some kind of communication between devs, or a guide with recommended load order numbers for different use cases. While that might be doable, it seems to me that this will become hard to maintain if full-blown dependency management is introduced at some point.

Add-ons configuring other add-ons

I'm just keeping this in here as a reminder to myself to think about it. As I outlined earlier, I think as soon as the first update_settings() callback is touched, depending on whether we want to introduce the load order attribute, add-on configuration should be fixed. This only leaves the option of introducing a third (optional) callback, configure_addons(), that is called before update_settings().

jdemaeyer commented 8 years ago

7eb837d is a rough version of my idea for the builtin add-ons, and it'd be cool to know if this is in line with the maintainer's ideas before I start writing tests and documentation.

In principal, the add-ons don't do much. All they do is act as a gateway to the 'traditional' settings. Neither the components nor Scrapy's default settings have been touched, so configuring components via the old way still works as expected. Additionally, it allows changing the settings via the add-on mechanism. That is, if I want to enable and the configure the http cache middleware, instead of adding this to my settings.py:

HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 60
HTTPCACHE_IGNORE_HTTP_CODES = [500, 503]

I can now alternatively use this in my scrapy.cfg (or the equivalent in the ADDONS setting as outlined in the comments above):

[addon:httpcache]
expiration_secs = 60
ignore_http_codes = 500,503
; enabled is already implicitly set through the existence of this section

I think this is close to the idea of the original SEP-021, with the addition that each add-on has its own section.

If there ever comes a new milestone release where backwards-incompatible cleanup is encouraged, the add-on configuration could be removed from the global settings and live within the add-ons/components instead, making them fully independent of any Scrapy/settings ecosystem. Components with interdependencies could still expose some settings into the global settings namespace, or access other add-ons settings through the add-on manager. Or, as they are interdependent anyways, they could be configured through the same add-on.

curita commented 8 years ago

Sorry I took so long to reply, I'm still unsure on the alternatives for the add-ons configuration in settings.py, but I'll comment some thoughts I have about this matter. Most of them are things you already said but I wanted to write them to get my head around this.

Add-on configuration entry point I agree that it seems an odd design decision having the add-ons config in the settings file, but it definitely simplifies things. Out of both presented options, the ADDONS variable seems the more flexible one. I'm not crazy about it but I don't mind configuring add-ons this way in settings.py, django uses this approach with multiple nested dicts a lot.

What I'm unsure of is that I think we're overcomplicating how current components are configured. We would be changing these:

scrapy crawl myspider -s HTTPCACHE_EXPIRATION_SECS=60
class MySpider(scrapy.Spider):
    name = 'myspider'
    custom_settings = {
        HTTPCACHE_EXPIRATION_SECS=60,
    }

to these:

scrapy crawl myspider -s ADDONS={'httcache':{'expiration_secs': 60}}
class MySpider(scrapy.Spider):
    name = 'myspider'
    custom_settings = {
        'ADDONS': {
            'httpcache': {
                'expiration_secs: 60,
            }
        }
    }

and I think that's sort of an usability step back.

I wouldn't mind doing this in external components that I probably won't be configuring too often, but for builtin components I'm not so sure, this could become too tedious too soon. I know we're still supporting the current global settings, but if we're going to favor a single config entry point that one should be as carefully considered as possible.

For this reason I started to prefer your first design option since it looses the furthermost enclosing dictionary, but a) the add-ons configurations should be loaded with the other settings so they can be configured using commandline and custom_settings, though I think adding them as ADDONS_* isn't much of a problem besides cluttering a little the settings, and b) the ADDONS_ prefix isn't really much of a difference from the other approach, it's a workaround for defining a namespace.

So, I dislike a little having to explicitly tell you're configuring an add-on if that can be deduced some other way. A third option could be to copy some other functionality from django here (I wonder how many times that has been said in scrapy's history :smile:), I'm thinking something like:

INSTALLED_ADDONS = (
    'scrapy.addons.builtins.depth',  # or 'depth' maybe
    'scrapy.addons.builtins.httperror',
    ... # all builtins, this could be added to the settings.py template
    'path.to.mypipeline',
)

MYPIPELINE = { # uppercased add-on name
    'user': 'blah',
    ...
)

Not sure if possible though, but I kind of like copying the INSTALLED_APPS django setting, and the separated add-on config looks a lot like current settings:

HTTPCACHE = {
    'enabled': True,
    'expiration_secs': 60,
    'ignore_http_codes': [500, 503],
}

Add-on callback placement Don't mind the comment I made before, forgot how zope.interface.Interface worked and thought it was limiting update_settings and check_configuration to be only instance methods. Then I thought about having the possibility to call those methods without instantiating the class (if those are classmethods) but I don't see any advantage over instantiating it and calling them later. I like the current callback placement.

Builtin add-ons I really like the overall implementation minus a detail: I pictured adding components to their related component-dict settings slightly differently. I was thinking add-ons managing a single component could have a type and order fields in the interface (maybe order can be a config too) that could be used for automatically inserting themselves (or their path) to the component-dict setting defined in type.

I realize these builtin add-ons are sort of encapsulating the actual components, they have to add the components' paths to the settings and not themselves, but maybe the inserted path/obj could be parametrized for this particular case.

curita commented 8 years ago

Forgot to comment the error handling in Crawler.crawl() :sweat:

I don't have a strong preference here, I don't think your last implementation dealing with failures in every deferred looks that ugly. Maybe all those try/excepts could be flattened by handling callbacks manually (i.e. without inlineCallbacks), adding the corresponding errbacks for exception handling to the deferreds, but I think your version is more readable. I like either updating the engine.stop() method or adding a new engine.close() too.

For monitoring pending calls, I like relying on trial to check whether the reactor is clean, seems simpler and I think I read this was a good practice in the twisted testing guidelines. trial probably calls reactor.getDelayedCalls() for this, maybe we could call it explicitly but I think leaving this to the testing suite is probably the best option.

About assertFailure, we have tried to get rid of trial sometime in the past because it wasn't fully python3 compatible. Removing the trial.TestCase inheritance wasn't possible though because the pytest plugin we're using to mimic trial behavior doesn't work on unittest.TestCase classes. But seeing how we're already using assertFailure in other tests, and how we probably won't be changing that inheritance now I agree there isn't any problem using it.

jdemaeyer commented 8 years ago

A few small updates/thoughts:

I opted for a new ExecutionEngine.close() method instead of updating the stop() method. The reason is that I didn't want to remove the assert statement made in ExecutionEngine.stop() (after all the whole idea was to clean up even if the engine wasn't running yet). Code: close() method and usage in Crawler.crawl() I've also cleaned up the graceful shutdown test.

I've enhanced base Addon class: It is now much easier to configure single components through the component, component_type, and component_key attributes and the export_component method. Thank you for the idea @curita, does that look somewhat like what you had in mind? I've also updated the built-in add-ons to make use of that new feature, and they look much tidier now.

@curita makes a very valid point with how the ADDONS dictionary is a usability stepback. I like the Django-like INSTALLED_ADDONS approach and have added a corresponding method and test alongside the old methods. A few thoughts:

@kmike mentioned in one of his first posts if it would be possible that Spiders also implement the add-on interface and have their methods called through the same code path. I very much like that idea but so far couldn't wrap my head around a couple of questions: Should we support config for spiders and what would it be? Should the Spider be added to the add-on manager? Or should the add-on manager update_settings() etc. method have an optional keyword argument where the Spider is passed to call its callbacks?

Now that I've coded around the add-on manager quite a bit, I think @kmike was indeed right in that it's not very useful to save the used_keys, I think I'll drop them soon.

I have refactored the check_dependency_clashes() method so that it now uses pkg_resources. But I wonder if we can make even more use of that package?

Last, I realise that this PR is getting very large (+2,000 lines, phew) and therefore hard to review. However I don't see how it could be split up into smaller chunks. Any ideas how I could make it more clearly laid out?

curita commented 8 years ago

I opted for a new ExecutionEngine.close() method instead of updating the stop() method. [...]

Love how clean that solution resulted, if you want you can open a separate pull request with it and its test and we can merge it right away.

I've enhanced base Addon class: It is now much easier to configure single components through the component, component_type, and component_key attributes and the export_component method. Thank you for the idea @curita, does that look somewhat like what you had in mind?

That was pretty much what I had in mind :) Not sure if I'd have added the abbreviated versions of the component_types though, they seem rather cryptic for the sake of saving keystrokes, I'd prefer users to stick with the fullnames. Maybe I'd have added the component order as a class attr to match both component and component_key placements (in addition of being a config option), but this isn't necessary, and it could be taken the wrong way by thinking the order can't be changed.

A stumbling block with this approach [INSTALLED_ADDONS] is that it's not possible to append, not overwrite, to the INSTALLED_ADDONS setting from the command line; not quite sure how to overcome that...

That's a prevailing problem with any list valued setting :( I think as long as we can enable or disable add-ons using their configs (not sure if this is true for all add-ons though, maybe we could support a general enabled config in AddonManager?) it isn't a big drawback to have a fixed INSTALLED_ADDONS in settings.py listing all add-ons used across the spiders in a project. This is far from ideal (dynamically adding/removing add-ons to the list will be hard) but I think it's a reasonable compromise.

@kmike mentioned in one of his first posts if it would be possible that Spiders also implement the add-on interface and have their methods called through the same code path.

I don't recall right now any constrain on spiders being add-ons (hm, wait, calling the spider update_settings() at the same time as the other add-ons update_settings() couldn't conflict with the add-ons config? could a spider still disable another add-on this way?), but if there aren't any issues (f8cdf1a?) let's implement it :+1:

Should we support config for spiders and what would it be?

If possible, I'd like to see spider arguments mapped into the config. A neat side effect of this would be that spider arguments could be configured like any other regular setting.

Should the Spider be added to the add-on manager? Or should the add-on manager update_settings() etc. method have an optional keyword argument where the Spider is passed to call its callbacks?

The second option is the one you implemented in f8cdf1a right? I like it, I think this distinction should be helpful at some point.

I have refactored the check_dependency_clashes() method so that it now uses pkg_resources. But I wonder if we can make even more use of that package?

Not sure myself, I thought we could have a custom Addon class for add-ons defined inside packages and get version and requires from that package using pkg_resources but I'm not sure if that's even something that should be done with it.

Last, I realise that this PR is getting very large (+2,000 lines, phew) and therefore hard to review. However I don't see how it could be split up into smaller chunks. Any ideas how I could make it more clearly laid out?

Cleaning up the commit history should be a must to get the pr fully reviewed once it's done, but I don't see how its readability could be improved any further. Your code is really well documented so don't mind its extension too much :)

jdemaeyer commented 8 years ago

If possible, I'd like to see spider arguments mapped into the config. A neat side effect of this would be that spider arguments could be configured like any other regular setting.

(Comment moved to #1442)

jdemaeyer commented 8 years ago

I think as long as we can enable or disable add-ons using their configs (not sure if this is true for all add-ons though, maybe we could support a general enabled config in AddonManager?)

calling the spider update_settings() at the same time as the other add-ons update_settings() couldn't conflict with the add-ons config? could a spider still disable another add-on this way?

I added the possibility to disable any add-on by setting its config _enabled setting to False, that should allow disabling add-ons both from the Spider's update_addons() method as well as from the command line.

Once again, since there is currently no guarantee in which order add-on callbacks are called, setting an add-ons config['_enabled'] = False from a second add-on may not prevent the first add-ons update_addons() method from being called. However, it just struck me that, should we settle for the INSTALLED_ADDONS list setting, the user now has the possibility to order add-ons, so maybe I could replace the AddonManager._addons dictionary with an ordered dictionary (also, INSTALLED_ADDONS could be a dictionary-like {'addonpath': order} setting instead of a list, just like the component settings. That would also resolve the "not easy to append instead of overwrite" problem).

Note to self so I don't forget to remove some code later: 2039892 made the _source config superfluos

codecov-io commented 8 years ago

Current coverage is 83.42%

Merging #1272 into master will increase coverage by +0.69% as of 48b077b

Powered by Codecov. Updated on successful CI builds.

jdemaeyer commented 8 years ago

hm, wait, calling the spider update_settings() at the same time as the other add-ons update_settings() couldn't conflict with the add-ons config? could a spider still disable another add-on this way?)

(Comment moved to #1442)

jdemaeyer commented 8 years ago

Rebased onto #1149 and squashed the earlier commits (up until a0816fa) into groups to ease reviewing :)

jdemaeyer commented 8 years ago

Ready for review! I know it's huge, sorry :)

jdemaeyer commented 8 years ago

Rebased.

While this is a 'ready' implementation, there are some design questions that should be reconsidered.

Bold alternatives mark design changes I have implemented in later commits.

Major:

What Current Implementation Alternatives
Add-on configuration entry point Preferredly, add-ons are enabled through a tuple setting, INSTALLED_ADDONS (inspired by Django), and configured through an ADDONNAME setting. Additionally (but not preferred), add-ons can also be enabled and configured with one section per add-on in scrapy.cfg.
  • Make INSTALLED_ADDONS (maybe just ADDONS?) a dictionary, so it is consistent with the middleware/extension settings
  • Make configuration through scrapy.cfg the preferred way but keep configuration through settings to allow using add-ons outside of projects and ad-hoc configuration in the command line (emphasises the "add-ons are above settings" idea)
  • Completely drop support for scrapy.cfg to avoid splitting up the configuration entry point (somewhat opposes the "supersimple to configure for users" goal)
  • Drop support for configuration through settings in favour of scrapy.cfg (emphasises the 'add-ons are above settings' idea but makes using add-ons outside of projects, and configuring add-ons from the command line, impossible)

Inbetween:

What Current Implementation Alternatives
Add-ons enabling/configuring other add-ons In a update_addons() callback (and only there), add-ons can load/enable and configure other add-ons. They cannot use the INSTALLED_ADDONS setting but must go through the add-on manager methods.
  • Drop support for add-ons configuring other add-ons (renders creating "umbrella add-ons" impossible, and makes add-ons slightly less user-friendly because add-ons cannot automatically enable required other add-ons)
Callback signatures The callback signature are quite restrictive to avoid add-on authors misusing them, i.e. the update_addons() callback only receives the add-on manager but not the settings, the update_settings() callback gets only the settings but not the crawler or the add-on manager.
  • Hand over the crawler to all callbacks to make them more versatile and rely on developers to know what they're doing
Add-ons for builtin extensions/middlewares For every builtin extension/middleware, there is a corresponding add-on. This add-on simply proxies the configuration into the global settings. Users can now do most of the configuration in scrapy.cfg through these proxy add-ons if they wish
  • Drop built-in addons to avoid creating a new configuration entry point for the built-in components
Dependency management Add-ons can provide information about their dependencies through provides, modifies and requires attributes. The check_dependency_clashes() function checks these for clashes with the help of pkg_resources.
  • Make sure that these really are the three attributes we want to use
  • Outsource dependency management into a new PR to allow dedicated development/discussion there while moving forward with the add-on framework

Minor:

What Current Implementation Alternatives
Add-on manager name AddonManager I totally see your point @kmike, I just can't think of a better name ;) the add-on manager is very multi-purpose (enable addons, find addons, provide configuration entry point, check dependencies, call callbacks, ...). Maybe it is too monolithic?
Finding add-ons When passing httpcache to the AddonManager.get_addon() method, it will first look for a module/object httpcache on the Python path, then for projectname.addons.httpcache, then for scrapy.addons.httpcache. This allows slightly more user-friendly addressing of add-ons.
  • Drop looking for projectname.addons. because the project name can not always be detected reliably, possibly leading to inconsistent behaviour
  • Drop looking for projectname.addons. as well as for scrapy.addons., forcing the user to always write out the full path, consistent with the middlewares and extensions
Interface declaration The scrapy.interfaces.IAddon zope interface only requires the existence of a name and version attribute.
  • Find out how to enforce the signature of a method without enforcing its existence (add-ons should be able to choose to implement only one or two callbacks)
jdemaeyer commented 8 years ago

My humble opinion on these:

Since the summer, I've changed my mind about a few things quite a bit. Given that Scrapy is a framework rather than a do-it-all product with a shiny GUI, I think the focus really should lie on consistency and versatility instead of abandoning these for user-friendliness.

I reeeaally love the ease and syntax of managing add-ons from scrapy.cfg. But I fear that allowing two configuration points will bite us in the ass at some point. Right now I tend toward completely dropping the scrapy.cfg support (after all, how many users won't be able to handle the simple Python syntax in settings.py)?

Somewhat as a consequence, I don't really see a reason to keep the built-in add-ons without scrapy.cfg support. While the cfg syntax would indeed be much more beautiful, within settings.py there really isn't much advantage of this:

HTTPCACHE = {
    'dir': 'blah',
    'expiration_secs': 99,
}

over this:

HTTPCACHE_DIR = 'blah'
HTTPCACHE_EXPIRATION_SECS = 99

With no built-in add-ons, there also isn't much reason to keep the "prefix scrapy.addons." logic in get_addon() either.

For the sake of versality, I think all callbacks should receive the crawler object (or maybe (crawler, settings, addons) as shortcuts for developers), and that the "add-ons configuring other add-ons" logic should stay in, with a fat note to developers that they cannot use the INSTALLED_ADDONS settings but must use crawler.addons.add().

For consistency with middlewares/extensions, I think it makes sense to change the INSTALLED_ADDONS tuple into a ADDONS dictionary, with the order existent but probably not really relevant (as with EXTENSIONS). This should also make it easier to disable add-ons from the command line.

jdemaeyer commented 8 years ago

Just making sure I haven't deviated too much from the original idea while working on this (yay for tunnel vision):

The main goals of add-ons are:

Add-ons could then be defined as follows:

Add-ons are pluggable configuration units. An add-on contains a set of configuration settings and checks. All Add-ons are enabled at a single entry point.

Everyone d'accord? :)

kmike commented 8 years ago

I'm not super-familiar with the original idea, but this description sounds perfect, +1 to have an implementation in Scrapy.

kmike commented 8 years ago

A nice table, makes it much easier to understand the PR :+1:

jdemaeyer commented 8 years ago

Callback signatures - I like that addons don't know about a crawler until check_configuration is called. It makes it possible to build an almost final Settings object without a Crawler (?). Are there use cases for passing a Crawler earlier?

There might be some edge cases, like needing access to the signals really early, although I don't see a concrete example right now. Another edge case might be monkeypatching the crawler methods by overwriting them... Passing the add-on manager to update_settings() might be useful when the settings depend on other add-ons' configurations.

I don't think these outweigh the pros of being able to construct an almost final (why almost?) settings object without a crawler though.

Add-ons for builtin extensions/middlewares - the implementation doesn't look DRY - priorities and default options are currently duplicated (?). I think it makes sense to convert extensions to addons only if they have a use of addon features. Maybe short addon names is enough to justify creating addons for built-in components though. But it should be DRY.

The default options still live in default_settings.py, but the priorities are indeed duplicated. That was a preparation for being able to replace the XY_BASE settings (or the defaults in XY depending on how we proceed with #1149) with just a default ADDONS_BASE (or ADDONS) setting, but for now it should probably be removed (we can still do that later since the internal structure of the built-in add-ons is not public API).

Upon further thought I think the built-in add-ons should stay. After all if we promote a single configuration entry point for "functionality enhancements" by using ADDONS over DOWNLOADER_MIDDLEWARES and similar, we should also promote using

ADDONS = {
    # ...
    'httpcache': 0,
}

over

HTTPCACHE_ENABLED = True

Dependency management - it may be nice to have, but my vote is to do it later, in another PR.

+1

Add-on manager name - What do you think about the following? rename the class to Addons - it represents a set of addons, not something wich just manages a set of externally maintained addons;

I like that. Maybe AddonSet? And load_dict and similar could then be renamed to add_from_dict, that would make it look more like Python's built-in set. For ii-v, I don't see the advantage that would offset the hassle of having to merge different Addons instances.

Finding add-ons - My vote is to support 1. setuptools entrypoints and 2. python paths

+1 for supporting entry points. The scrapy.addons prefix is purely for the built-in addons (so users can write httpcache instead of scrapy.addons.httpcache).

+1 to use ADDONS name instead of INSTALLED_ADDONS. Regarding priorities - hm, why don't they matter?

It's not that they don't matter at all, but that they probably don't matter as add-ons probably won't be cross-dependent too much. Just like the priorities of EXTENSIONS often don't matter.

jdemaeyer commented 8 years ago

Rebased onto #1586 and implemented the following design changes:

chekunkov commented 8 years ago

Add-on configuration entry point

Completely dropped support for configuring add-ons in scrapy.cfg

+1

Replaced INSTALLED_ADDONS tuple with ADDONS dictionary as entry point for add-on configuration

hm, -1 if the only reason of this change is "so it is consistent with the middleware/extension settings"

Add-ons enabling/configuring other add-ons

Can you point me to the part of doc which describes how this cross-configuration is going to work?

Callback signatures

generally I like the idea of being more strict, but it should be justified - what could go wrong if we give

The callback signature are quite restrictive to avoid add-on authors misusing them

what is understood by misuse? example?

Add-ons for builtin extensions/middlewares

agree with @kmike https://github.com/scrapy/scrapy/pull/1272#issuecomment-152682302

also one of the current implementation point is irrelevant now as scrapy.cfg support is dropped - "Users can now do most of the configuration in scrapy.cfg through these proxy add-ons if they wish"

Dependency management

+1 to separate PR, agree with @kmike https://github.com/scrapy/scrapy/pull/1272#issuecomment-152682302

Add-on manager name

Honestly don't see a big conceptual difference between AddonsManager and Addons as far as it doesn't change how it works :)

(2-5) is very minor, to decrease usage of mutable state.

load_cfg is already removed, so there are only load_settings and load_dict to decide. load_settings is used once during Crawler.__init__. what's expected usage of load_dict? I see implementation, what is expected usecase?

Finding add-ons

Enabling the http cache now requires ADDONS = {'scrapy.addons.httpcache': 0} instead of ADDONS = {'httpcache': 0}

+1 to remove damn magic

Interface declaration

@jdemaeyer can you explain what's the problem with current implementation?

jdemaeyer commented 8 years ago

Thanks alot for the feedback you guys!

Replaced INSTALLED_ADDONS tuple with ADDONS dictionary as entry point for add-on configuration

hm, -1 if the only reason of this change is "so it is consistent with the middleware/extension settings"

It's basically the same rationale with which ITEM_PIPELINES became a dictionary and no longer a list. I have to admit that I still like the idea of ADDONS being a tuple/list though. As the "update-not-replace" is behaviour has been revoked (#1586), lists/tuples are again not harder to update from custom_settings or the command line than dicts. On the other hand, being able to disable components by setting their dictionary value to None comes in handy at times.

Add-ons enabling/configuring other add-ons

Can you point me to the part of doc which describes how this cross-configuration is going to work?

Here's an example: https://github.com/jdemaeyer/scrapy/blob/enhancement/addons/docs/topics/addons.rst#enabling-and-configuring-add-ons-within-python-code

Callback signatures

generally I like the idea of being more strict, but it should be justified - what could go wrong if we give what is understood by misuse? example?

I don't think there's much that could really "go wrong", it's more an notion of guiding people. E.g. in update_addons(), they should not make decisions based on setting values, as these setting values are still subject to change (e.g. in other add-ons' update_settings()). Not putting the crawler into the callback signature has the advantage pointed out by Mike above (being able to construct a final settings object without needing a crawler).

Add-ons for builtin extensions/middlewares

I've removed the repetition of the default settings. Most of the built-in add-ons only provide a second way to configure the built-in extensions, I think we should drop these.

There are some that will overwrite default settings when activated. E.g. enabling the httpcache add-on will automatically set HTTPCACHE_ENABLED = True. What do you guys think about these? I think they make sense when we want to push users towards enabling extra functionality (built-in or not) through the ADDONS setting.

what's expected usage of load_dict? I see implementation, what is expected usecase?

It's a low-maintenance convenience helper for working with the AddonManager. There's no use in Scrapy itself but I've used it in some of the tests.

Interface declaration

can you explain what's the problem with current implementation?

It's a minor detail really, I just wanted the list to be complete ;) I don't want to enforce that add-ons have an update_settings() method (they might not need one after all). But when they have one, I want to enforce our call signature. However I don't think that's possible with zope.interface right now.

jdemaeyer commented 8 years ago

Rebased onto master, this PR now once again contains no commits from other PRs. (Finally the green checkmark is back :))

pawelmhm commented 7 years ago

this is so huge, is there some way to save pieces of this work by splitting this into more smaller PR-s? e.g. I'm looking forward to changes in load_object, is it ok if I steal it from here (cherry-pick) and create new PR with that? https://github.com/scrapy/scrapy/pull/1272/files#diff-2536f98aac2c474ec1f424c8a943b9d2

coveralls commented 7 years ago

Coverage Status

Changes Unknown when pulling a65fc0db7d7d728bc8abbb37eff358870ce7a9b5 on jdemaeyer:enhancement/addons into on scrapy:master.