Add-ons - Githubissues

scrapy / scrapy

Scrapy, a fast high-level web crawling & scraping framework for Python.

https://scrapy.org

BSD 3-Clause "New" or "Revised" License

51.16k stars 10.35k forks source link

Add-ons #1272

Closed jdemaeyer closed 9 months ago

jdemaeyer commented 8 years ago

Based on ~~#1149 and #1423~~ ~~#1586~~ master

Closes #591 and #1215

Implementation of SEP-021 (updated version from this PR, original version)

Design decision to-do here

Old to-do:

Design decisions
- [x] Configuration entry point (settled on INSTALLED_ADDONS setting and additional support for scrapy.cfg so far)
- [x] Integration into start-up process
- [x] Add-on interactions (can add-ons configure other add-ons?)
- [x] Find better name for AddonManager?
- [x] Update SEP
Code
- [x] Add-on interface
- [x] Base Addon class
- [x] Load add-on configuration from scrapy.cfg
- [x] Load add-on configuration from settings module
- [x] Search add-ons by name
- [x] Improve get_project_path() robustness (done partially, still fails when settings module does not live in project's root)
- [x] Load & verify add-ons
- [x] Add-on "holding" (a place where loaded add-ons live)
- [x] Helper functions to call add-on callbacks
- [x] Dependency clash management
- [x] ~~Add version support~~ Use pkg_resources' extensive dependency management
- [x] ~~Should pkg_resources be used throughout the manager and not just within the dependency checking?~~ Will go to separate PR if we decide to
- [x] Integrate into standard start-up process
- [x] Integrate dependency clash checks
- [x] Allow putting objects (not just class paths) in component lists, e.g. PIPELINES (check #1215)
- [x] Write add-ons for built-in components
- [x] Fix add-ons for components which do not prepend their settings with their name
- [x] ~~Move settings for built-in components from global namespace to add-ons (is this possible without making them harder to configure?)~~ Allow setting global namespace settings via add-on framework
- [x] Allow spider to implement add-on callbacks (like it already does with update_settings())
Tests
- [x] Base Addon class
- [x] Loading configuration from cfg files
- [x] Loading configuration from Python modules
- [x] Search add-ons
- [x] Load and verify add-ons
- [x] Callback helper functions
- [x] Dependency clash management
- [x] Integration into start-up process
- [x] Fix test for check_configuration()
- [x] Putting objects in component lists
Documentation
- [x] Docstrings for AddonManager, then use autodoc in docs/topics/api.rst
- [x] New topic on add-ons
- [x] Mention _addon attribute, more versatile examples (using _addon, module as add-on)

jdemaeyer commented 8 years ago

I think one of the major open questions is where the add-on manager should live. Here's an excerpt from my proposal on that (I have opted for the inside of crawler option for the SEP so far):

Integration into existing start-up process

The question where the add-on manager should be instantiated and when the add-on callbacks should be called is not as obvious as it may seem at first glance. Outlined below are three options and their rationale:

Outside of crawler

Besides the different priority and level of direct user involvement, settings made by the add-on callbacks are not different from settings made in settings.py. The callbacks should therefore be called around the same time that the settings module is loaded. This puts the instantiation of an AddonManager instance outside of the Crawler instance.

Given that the project settings are read in the get_project_settings() function in scrapy.util.project, this seems a reasonable place to call AddonManager.update_settings(). However, we cannot instantiate the add-on manager within this function, as the function is left (and the manager would therefore be lost) long before the crawler becomes ready (when we wish to call the second round of add-on callbacks).

There are two possible approaches to instantiating the add-on manager outside of get_project_settings():

In a normal Scrapy start via the command-line tool, calling get_project_settings() is embedded into the execute() function in scrapy.cmdline. In summary, the changes to this function would be (with analogoue changes in scrapy/conf.py, where necessary, for backwards-compatibility):
1. Instantiate add-on manager before calling get_project_settings()
2. Pass add-on manager to get_project_settings() when calling it (the function then calls update_settings()).
3. Connect the manager's check_configuration method to the engine_started signal (this could also be done in the add-on managers 'init()' method)
Alternatively, we could (grudgingly) introduce another singleton to Scrapy (e.g. scrapy.addonmanager.addonmanager). This would allow moving the above code related to add-on management above into the more appropriate get_project_settings() function.

Integrating add-on management outside of the crawler ensures that settings management, except for spider-specific settings, remains within a rather small part of the start-up process (before instantiating a Crawler): the helper function get_project_settings() indeed keeps returning the full set of project (i.e. non-spider) settings. The downside is that it either introduces a new singleton or clutters up a function (execute()) that should be more about parsing command line options and starting a crawler process than about loading add-ons and settings.

Inside of crawler

The settings used to crawl are not complete until the spider-specific settings have been loaded in Crawler.__init__(). Add-on management could follow this approach and only start loading add-ons when the crawler is initialised.

Instantiation and the call to AddonManager.update_settings() would happen in Crawler.__init__(). The final checks (i.e. the callback to AddonManager.check_configuration()) could then again either be tied to the engine_started signal, or coded into the Crawler.crawl() method after creating the engine.

Integrating add-on management inside of the crawler avoids introducing a new singleton or cluttering up the execute() function, but rips apart compiling the complete configuration settings. This is especially critical since many of the settings previously made in settings.py will move to scrapy.cfg with the implementation of this proposal, and may prove backwards-incompatible since the get_project_settings() helper function no longer works as expected (and as its name suggests).

kmike commented 8 years ago

Hey @jdemaeyer,

I've read your proposal and the SEP for the first time, so please excuse me if I'm missing some context :)

Is it a coincidence that scrapy.Spider implements update_settings method which addons will implement?
Addons are utilities to auto-update and check Settings objects? Why do they need Crawler?
How can user activate an addon when crawling is started by CrawlerRunner or CrawlerProcess?
How can user enable an addon if there is no Scrapy project?
How to enable an addon if scrapy runspider is used to start a spider?
Why is scrapy.cfg needed? It is confusing to have two config files, and it is an unnecessary boilerplate if you don't have a Scrapy project.
In the SEP it is said that MiddlewareManager.from_settings() will be changed to allow python objects instead of class paths. Why should it be specific to MiddlewareManager? (see discussion at https://github.com/scrapy/scrapy/pull/1215).
Do you think it is possible to make Scrapy addons feature less framework-ish? In the current proposal Scrapy calls various AddonManager methods during the standard scrapy crawl init sequence, and AddonManager calls various addon methods itself. Can we made it so that user can also use the AddonManger (or individual addon) to update or check settings, without invoking all the scrapy crawl machinery?
It looks like the proposed AddonManager does two unrelated things:
- it loads addon classes from scrapy.cfg;
- it uses loaded addons to update settings and check crawler.
IMHO it is better to separate those. Scrapy users who don't use scrapy crawl command may prefer to import addon classes themselves, or these classes can be in the same module as CrawlerProcess call and a spider, all in the same self-contained Python script.

// a rant: I think it is usually a bad idea to name somathing ..Manager :) "Manager" tells us that a class "manages" something. It is quite broad, so 1) often such classes end up doing several unrelated things and 2) it is not clear what class is doing just from the class name - it can do anything. In Scrapy we renamed SpiderManager to SpiderLoader because of that. It turns out this is not only mine opinion.

jdemaeyer commented 8 years ago

Hey @kmike, thanks for the feedback!

1 . Is it a coincidence that scrapy.Spider implements update_settings method which addons will implement?

That is on purpose. Spider.update_settings() and the add-on update_settings() both do roughly the same, i.e. update settings outside of settings.py, albeit with a different level of "generality"

2 . Addons are utilities to auto-update and check Settings objects? Why do they need Crawler?

Add-ons should create a plug-and-play experience for the user. For checking that everything "Works As Advertised™", and provide help when it doesn't, they need to know the actual state of the crawler, and not just how its state is supposed to be from the configuration. E.g. an extension listed in settings['EXTENSIONS'] could have thrown NotConfigured. Add-ons can only detect this if they have access to the crawler.

3 . How can user activate an addon when crawling is started by CrawlerRunner or CrawlerProcess?

Hm, don't both of these instantiate Crawler objects where the add-on manager would then be instantiated and load add-ons? Then again I guess it again comes down to whether a scrapy.cfg exists in these use cases (also see answer to 6).

4 . How can user enable an addon if there is no Scrapy project? 5 . How to enable an addon if scrapy runspider is used to start a spider?

That would not be possible with the current suggested implementation. Just like you cannot have a settings.py file or use scrapyd-deploy in these cases. It could be made possible if add-on configuration were to live in settings using per-spider settings. I'll put the (current) rationale for scrapy.cfg below.

6 . Why is scrapy.cfg needed? It is confusing to have two config files, and it is an unnecessary boilerplate if you don't have a Scrapy project.

I opted for scrapy.cfg for two reasons:

Its syntax is more user-friendly for uninitiated users
It makes it easy to keep settings local through the ini sections. E.g. something like this:
```
 [mysql-pipe]
 server = some.server
 port = 1234
 user = a
 password = b

 [mongodb-pipe]
 server = some.other.server
 user = c
 password = d
```
is much easier in scrapy.cfg than in settings.py, and would not have to expose anything into the settings object, avoiding possible nameclashes.

I think if done properly, users won't have to touch settings.py for any of the shipped add-ons at all (instead they could set, say HTTPCACHE_DIR, in the scrapy.cfg. We could even tidy up the settings namespace this way). Those users that then need to touch settings.py will most probably know what they're doing anyways.

7 . In the SEP it is said that MiddlewareManager.from_settings() will be changed to allow python objects instead of class paths. Why should it be specific to MiddlewareManager? (see discussion at #1215).

Uh, I didn't see #1215 before, that looks good. The SEP goes a little further than the PR though: It suggests to allow not only classes, but instances. This will allow add-ons to instantiate components and keep settings as local as possible. Objects retrieved from X = load_object() are (so far) always assumed to be classes and then instantiated with y = X(). That should probably stay that way, and it is a good question where the "allow instances" code could live so it is as generic as possible.

8 . Do you think it is possible to make Scrapy addons feature less framework-ish? In the current proposal Scrapy calls various AddonManager methods during the standard scrapy crawl init sequence, and AddonManager calls various addon methods itself. Can we made it so that user can also use the AddonManger (or individual addon) to update or check settings, without invoking all the scrapy crawl machinery?

Hm, I'd say it is already quite modular. The user could in principle instantiate their own AddonManager and hand it the .cfg file. Or they could load an add-on object and call its update_settings() method directly.

9 . It looks like the proposed AddonManager does two unrelated things:

it loads addon classes from scrapy.cfg;

it uses loaded addons to update settings and check crawler. IMHO it is better to separate those. Scrapy users who don't use scrapy crawl command may prefer to import addon classes themselves, or these classes can be in the same module as CrawlerProcess call and a spider, all in the same self-contained Python script.

That is a good point. The Manager idea arised from the #591 discussion, but now that I come to think of it splitting it up seems to make more sense. Also, it would avoid having a "...Manager" named class :). The loaded add-ons need to live somewhere between loading <-> updating <-> checking, though. With the current SEP, that could be in an attribute of Crawler (much like extensions).

Maybe like this?

An AddonLoader class, or even only a function, is handed the scrapy.cfg file and provides/returns an { 'addon name': (<addon object>, <addon config>) } dict. The check_dependency_clashes() functionality also lives within this class/function.
The addons dict is given to the Crawler object upon instantiation, much like settings.
Crawler calls the add-ons update_settings() and check_configuration() functions at appropriate times.
If the user does not want the Crawler to call the callbacks (e.g. because she does it somewhere else), she hands over an empty addons dict on instantiation. In this case, it might be tricky that the check_configuration() callback requires the crawler object.

I'm still quite unhappy about the scrapy.cfg/settings.py issue. It is indeed annoying overhead to have two configuration files. But then, I think it makes sense that add-ons and their configuration are not present in any Settings instance, and the scrapy.cfg syntax is very much suited for user-friendliness and local settings.

kmike commented 8 years ago

Hi @jdemaeyer,

That is on purpose. Spider.update_settings() and the add-on update_settings() both do roughly the same, i.e. update settings outside of settings.py, albeit with a different level of "generality"

Hm, do you think we can use exactly the same code path for Spider then? I.e. maybe spider can implement addon interface itself? I'm not sure when can it be useful, but why not :) This way a spider will be able not only to update settings, but also to check them.

It makes it easy to keep settings local through the ini sections. E.g. something like this: (... snip ...) is much easier in scrapy.cfg than in settings.py, and would not have to expose anything into the settings object, avoiding possible nameclashes.

I don't think it is a problem to separate sections in settings.py. Django-like approach works:

MYSQL_PIPE = dict(
    server='some-server',
    port=1234,
    user='foo',
    password='bar'
)

Maybe like this? ...

Yeah, your approach looks good to me.

Some questions:

Why are { 'addon name': (<addon object>, <addon config>) } dicts needed, what are keys ('addon names') for? Can it be a list of (<addon class>, <addon config>) pairs, or just a list of instantiated addon objects?

I'm still quite unhappy about the scrapy.cfg/settings.py issue. It is indeed annoying overhead to have two configuration files. But then, I think it makes sense that add-ons and their configuration are not present in any Settings instance, and the scrapy.cfg syntax is very much suited for user-friendliness and local settings.

What's the problem with having ADDONS list in settings.py? It means one less config file and more flexibility - addons will be able to disable or enable or check other addons. This way you also avoid changing Crawler API and other realted APIs - no need to add 'addons' argument. I think that two config files is less user-friendly than one config file. By default settings.py may contain only this ADDONS option.

jdemaeyer commented 8 years ago

I've written the add-on interface spec and two add-on loaders, one loading from scrapy.cfg and one loading from settings.py.

Is there a way I can retrieve the project module name? I could use the first part of os.env['SCRAPY_SETTINGS_MODULE'] but it seems to me that there should be a cleaner way.

Again, thanks for the feedback @kmike!

Hm, do you think we can use exactly the same code path for Spider then?

I think that should be possible, but I'm not sure how enforce that the update_settings() method is not called twice.

I don't think it is a problem to separate sections in settings.py. Django-like approach works:

One problem with the current SEP is that the add-on name (= variable name) is allowed to be a python path to the add-on, but python does not allow dots in variable names. I have therefore allowed a _name variable in the settings.py dicts for now.

Why are { 'addon name': (, ) } dicts needed, what are keys ('addon names') for? Can it be a list of (, ) pairs, or just a list of instantiated addon objects?

The keys are simply whatever the addon was called in scrapy.cfg/settings.py. I'm not yet fully sure if we can drop it this early so I'd prefer to leave it in. It cannot be a list of instantiated add-on objects with the current SEP b/c we also allow add-ons to be modules right now (i.e. cannot be instantiated)

I don't really like the idea of exposing add-ons in the global settings, because they are essentially "one level above it": Add-ons fiddle with settings and extensions, but never the other way around. That's why I've designed the settings module add-on loader to only load variables starting with addon_ (in lower case, so they're not picked up by Settings.setmodule()) so far. It would be cool to allow add-ons to load/configure other add-ons, though. In that case an add-on manager could become useful again...

jdemaeyer commented 8 years ago

After some more coding, I've deciced to again merge the AddonLoader and what I called the AddonHolder into an AddonManager. There are just too many cross-dependencies: I wanted the holder to still be able to load add-ons last minute (allowing add-ons to load other add-ons, though I'm not exactly sure whether we want to allow this) and check dependency issues, at the same time get rid of the confusing { used_key: (addon, configuration) } structure if the loader.

Now, the AddonManager provides two class methods that the user may use when she doesn't want to use the 'holding' part, and provides convenience methods to add add-ons in case she doesn't want to use the 'loading' part.

I have integrated the add-on system into Scrapy's start-up process (it currently reads from both scrapy.cfg and the settings module), but did no testing of a "fully integrated" system so far.

One issue currently in my head is: Do we allow add-ons to load/configure other add-ons? I have not fully thought through the implications of this. The main problem I see is that at the moment the first callback of an add-on is called (update_settings()), other add-ons might already have been called. I could see this work if we require update_settings() to not do any settings introspection/dependency checking/interaction with other add-ons whatsoever. Being able to load/configure an add-on at this point could be useful for "umbrella add-ons" that load other add-ons depending on their configuration

jdemaeyer commented 8 years ago

I've put some thought into the "add-ons loading/configuring other add-ons" issue. The big problem with the current implementation is that the first time code of an add-on is executed is during the first round of callbacks to update_settings(). Should an add-on load or reconfigure another add-on here, other add-ons might already have been called. While it is possible to ensure that the update_settings() method of the newly added add-on is called, there is no guarantee (and in fact, it is quite unlikely) that all add-ons see the same add-on configuration in their update_settings().

I see three possible approaches to this:

Forbid add-ons from loading or configuring other add-ons. In this case 'umbrella add-ons' would not be possible and all cross-configuration dependencies would again be burdened onto the user.
Forbid add-ons to do any kind of settings introspection in update_settings(), instead only allow them to do changes to the settings object or load other add-ons. In this case, configuring already enabled add-ons should be avoided, as there is no guarantee that their update_settings() method has not already been called
Add a third callback, update_addons(config, addonmgr), to the add-on interface. Only loading and configuring other add-ons should be done in this method. While it may be allowed, developers should be aware that depending on config (of their own add-on, i.e. the one whose update_addons() is currently called) is fragile, as once again, there is no guarantee in which order add-ons will be called back and whether some other add-on will want to touch config.

Currently I have implemented option 2. I have not quite made up my mind about it just yet, but I think I option 3 is my favourite for now.

jdemaeyer commented 8 years ago

I'm trying to write a test that ensures the crawler breaks down when an add-on's check_configuration() callback raises an exception, but I don't know enough about Twisted to understand why it doesn't work :/

Here's the test, put into the CrawlTestCase in tests/test_crawl.py:

    @defer.inlineCallbacks
    def test_abort_on_addon_failed_check(self):
        class FailedCheckAddon(Addon):
            NAME = 'FailedCheckAddon'
            VERSION = '1.0'
            def check_configuration(self, crawler):
                raise ValueError
        addonmgr = AddonManager()
        addonmgr.add(FailedCheckAddon())
        crawler = get_crawler(SimpleSpider, addons=addonmgr)
        yield self.assertFailure(crawler.crawl(), ValueError)

And this is the output I get from tox:

====================================================== FAILURES ======================================================
___________________________________ CrawlTestCase.test_abort_on_addon_failed_check ___________________________________
NOTE: Incompatible Exception Representation, displaying natively:

DirtyReactorAggregateError: Reactor was unclean.
DelayedCalls: (set twisted.internet.base.DelayedCall.debug = True to debug)
<DelayedCall 0x7ff8a481a950 [59.7946248055s] called=0 cancelled=0 LoopingCall<60>(Downloader._slot_gc, *(), **{})()>

I've also tried assertRaises, that tells me:

FailTest: ValueError not raised (<Deferred at 0x7fea3d2b3e60 current result: <twisted.python.failure.Failure <type 'exceptions.ValueError'>>> returned)

curita commented 8 years ago

The twisted error when testing check_configure() is a little tricky, and it's actually something that wasn't handled before.

====================================================== FAILURES ======================================================
___________________________________ CrawlTestCase.test_abort_on_addon_failed_check ___________________________________
NOTE: Incompatible Exception Representation, displaying natively:

DirtyReactorAggregateError: Reactor was unclean.
DelayedCalls: (set twisted.internet.base.DelayedCall.debug = True to debug)
<DelayedCall 0x7ff8a481a950 [59.7946248055s] called=0 cancelled=0 LoopingCall<60>(Downloader._slot_gc, *(), **{})()>

This output means there's a pending call that wasn't stopped properly. In that case is a twisted LoopingCall started by the Downloader. The downloader is stored in the engine, so that call reference can be accessed in a crawler as crawler.engine.downloader._slot_gc_loop.

The engine is instantiated in Crawler.crawl(). Problem is, that call is already running when check_configure() is called and fails in FailedCheckAddon, but there is no place that addresses that and closes the downloader (which stops the looping call) properly. Not sure why that wasn't handled for errors happening after creating the engine, I guess it's not noticeable unless running tests that tell you about it.

So, there are two ways of fixing this. First one is to move the line calling addons.check_configuration() above the instantiation of the engine. This would fix the test but doesn't help solving the underlining problem. Second approach would be to close the downloader if any problem raises from the commands after the engine creation. Something like:

try:
    self.spider = self._create_spider(*args, **kwargs)
    self.engine = self._create_engine()
    self.addons.check_configuration(self)
    ...
   except Exception:
       self.crawling = False
       if self.engine is not None:
            self.engine.downloader.close()
       raise

should be enough. If you want to you can implement this fix in this pr (or in another one) or we can implement it at some other time and follow the first approach for now.

Last thing about the test: I'd refrain to use self.assertFailure() if possible, it's something present in trial and not in the builtin unittest that can be omitted. This should work too:

@defer.inlineCallbacks
def test_abort_on_addon_failed_check(self):
    class FailedCheckAddon(Addon):
        NAME = 'FailedCheckAddon'
        VERSION = '1.0'
        def check_configuration(self, crawler):
            raise ValueError
    crawler = get_crawler(SimpleSpider)
    crawler.addons.add(FailedCheckAddon())
    with self.assertRaises(ValueError):
        yield crawler.crawl()

curita commented 8 years ago

I think there are some things to discuss about the design, I took note on some questions and suggestions I had so we can talk them through.

There are some concerns about the usage of the scrapy.cfg file to set the addons configuration. I agree here with @kmike that we need to support using the current settings for a couple of reasons:

I'm not sure if we can maintain the locality of the addons configuration. This could be backward incompatible when scrapy components get ported, there could be addons that depend in other addons config. Not sure if I'm stretching the issue, but at least it was mentioned in #1334 (sharing a retry setting with redirect middlewares).
As mentioned earlier in the pr, scrapy.cfg can't be the only way of configuring addons because having a project is not required for running scrapy, setting core components is something that should be possible with single spiders.
We can't remove the already present flexibly of configuring components dynamically via command line or custom_settings.

Both options, using the current settings and the scrapy.cfg files could coexist, but if there is any complication keeping both we should favor the first one.

On a related note, I'm not sure it's a good idea to discard _ENABLED settings (read about discarding them in the SEP), specially if we're keeping the possibility of dynamic configuration. Users may want to disable a component temporary without deleting or commenting out the whole .cfg section.

Some miscellaneous comments in the addons interface:

Related with what I mentioned before, if we loose settings locality, is it required to instantiate the addons so early in the loading process? Could update_settings() and check_configuration() be class methods?
An idea :) There could be more non-required attributes in the zope.interface with general information about the addons, for example, some field with a brief description, a dictionary detailing the settings that the addon reads, maybe a list with some non-required dependencies. This sort of copies some utilities from python packages, I wonder if we can make a specific interface for python packages that grabs these settings (name, version, dependencies, etc.) using setuptools.
I think we should add a field in the addons (maybe not required?) where the devs could set a fixed number denoting the loading order position for that addon. I know that we have REQUIRES/MODIFIES/PROVIDES to try to address that, but I doubt this is going to be enough to define all cases.
A detail, is it ok to enforce having a check_configuration() method in the addons? I understand update_settings() being required, addons' only connection with the crawling engine is by updating the settings, there is no point in not having this method, but enforcing check_configuration() seems a little restrictive. We could check in AddonManager.check_configuration() if the method exists or not and call it in the first case.

jdemaeyer commented 8 years ago

Pheew okay, improving the error handling in Crawler.crawl() took me an awful lot of time and I learned quite a bit about Deferreds and Scrapy's architectural details. I'm still really unhappy, so here's what I have, up for discussion. ;) (Sorry for the wall of text, I'm doing this also to sort out my thoughts)

So I put in the

       if self.engine is not None:
            self.engine.downloader.close()

in the Crawler.crawl() except clause as you suggested, and the test I had worked fine. But of course I wanted to introduce that code The Right Way™ so I wrote another test that checks whether the crawler shuts down gracefully when there's any error during the crawl startup process.

The test is in tests.test_crawl.CrawlTestCase.test_graceful_crawl_error_handling. I mocked ExecutionEngine.start to throw an exception and asserted that Crawler.crawling == False and that the downloader's close method has been called (although I couldn't figure out how to properly wrap the mock around the original method):

    @defer.inlineCallbacks
    def test_graceful_crawl_error_handling(self):
        crawler = get_crawler(SimpleSpider)
        class TestError(Exception):
            pass
        from scrapy.core.downloader import Downloader
        Downloader._orig_close = Downloader.close
        def close_downloader(*args, **kwargs):
            # Ugly work-around since I couldn't get supplying
            # wraps=Downloader.close to the mock to work...
            Downloader._orig_close(crawler.engine.downloader)

        with mock.patch('scrapy.crawler.ExecutionEngine.start') as mock_es, \
             mock.patch.object(Downloader, 'close') as mock_dc:
            mock_es.side_effect = TestError
            mock_dc.side_effect = close_downloader
            with self.assertRaises(TestError):
                yield crawler.crawl()
            self.assertFalse(crawler.crawling)
            self.assertTrue(mock_dc.called)

Now I got a different delayed call remaining after the test: <DelayedCall 0x7f069902cab8 [59.7916400433s] called=0 cancelled=0 LoopingCall<60.0>(LogStats.log, *(<SimpleSpider 'simple' at 0x7f06992bbd90>,), **{})(). I figured (after a while) that I needed to close the spider through the engine. However, as ExecutionEngine.close_spider() also closes the downloader, I need a case separation for the graceful shutdown. So here's what it looks like now:

        try:
            self.spider = self._create_spider(*args, **kwargs)
            self.engine = self._create_engine()
            self.addons.check_configuration(self)
            start_requests = iter(self.spider.start_requests())
            yield self.engine.open_spider(self.spider, start_requests)
            yield defer.maybeDeferred(self.engine.start)
        except Exception:
            self.crawling = False
            if self.engine is not None:
                if self.engine.spider is not None:
                    self.engine.close_spider(self.spider)
                else:
                    self.engine.downloader.close()
            raise

Both tests pass with this, but I think there are some issues . First, the except clause shouldn't have to figure out in which line the error occured. This is probably better (exception handling) style but quite ugly:

        try:
            self.spider = self._create_spider(*args, **kwargs)
            self.engine = self._create_engine()
            try:
                self.addons.check_configuration(self)
                start_requests = iter(self.spider.start_requests())
                yield self.engine.open_spider(self.spider, start_requests)
            except Exception:
                self.engine.downloader.close()
                raise
            try:
                yield defer.maybeDeferred(self.engine.start)
            except Exception:
                self.engine.close_spider(self.spider)
                raise
        except Exception:
            self.crawling = False
            raise

Maybe ExecutionEngine could get a new close() method that does the logic and decides whether it needs to stop the engine, close the spider, or close the downloader (or the stop() method could be updated)? Then, there's the test itself. It should test behaviour and not implementation: I don't really care whether Downloader.close() was called, I just want to know whether the reactor is clean. But the downloader doesn't have a status attribute or something. I could check whether the Downloader._slot_gc_loop was stopped, but again that would check implementation not behaviour.

Maybe it's best to remove the Downloader.close() mock and rely on twisted.trial to check whether the reactor is clean after the error shutdown?

jdemaeyer commented 8 years ago

Last thing about the test: I'd refrain to use self.assertFailure() if possible, it's something present in trial and not in the builtin unittest that can be omitted. This should work too:

Hm, why do we subclass trial's TestCase but then refrain from using its methods? I have (temporarily?) switched back to assertFailure because the twisted version in precise does not support using assertRaises as a context manager, and replacing trials's TestCase with the built-in one lets the tests pass but warns me that LogCapture instances are not uninstalled by shutdown (and I wasn't sure how bad that is ;))

jdemaeyer commented 8 years ago

Hey everyone, thanks again for the feedback (especially for the full review @curita :)), I really appreciate it. Here's my two cents on some of the design issues (once again, caution, wall of text):

Add-on configuration entry point

After I had to discuss the add-on management in a blog post a couple of days ago, I have backpedaled a little on my preference. scrapy.cfg is a place for those settings that are not necessarily bound to any particular project (if I have multiple projects I probably still want to upload them to the same scrapyd server). Add-on configuration is (most commonly) bound to a project, much like other crawling settings, and I agree that it should therefore live inside the project. If we don't want to introduce a new settings file (which I think would be a bad idea), that leaves us with the project settings module.

I see two possible syntaxes for this:

Each add-on has its own variable, like it is implemented right now, and is not exposed into the global settings:
```
addon_MYPIPELINE = dict(
   _name = 'path.to.mypipeline',
   user = 'blah',
   ...)
addon_MYEXT = dict(...)
```

There is an ADDONS dictionary setting that lives among all other global settings:

ADDONS = {
   'path.to.mypipeline': {
       'user': 'blah',
       ...
   },
   'path.to.myext': {...},
}

# or

ADDONS = {
   'path.to.mypipeline': dict(
       user = 'blah',
       ...
   ),
   'path.to.myext': dict(...),
}

The first approach does not mix settings and add-ons, following the 'add-ons are one level above settings' philosophy I had in mind earlier. However the syntax is not very user-friendly, particularly having to use the _name key. The second approach puts add-ons onto the same level as the settings they are intended to fiddle with. I don't see an immediate drawback of this, but it strucks me as odd design. I also find the nested dictionaries, like these used for logging.config.dictConfig(), a little tedious. The user experience would be much more cumbersome than what we had initially hoped for. The big big plus is that it would integrate really well with custom Spider settings.

Now the user might want to enable an add-on system-wide, and I think we could keep the support for scrapy.cfg add-on configuration for this use case only. However, the docs should emphasise that this configuration does not belong to a project, and is therefore not deployed.

Add-ons without a project

I agree that there should be a way to enable and configure add-ons for stand-alone spiders. That would be no problem at all when using syntax 2 from above, so I guess that is what it will boil down to. We could keep syntax 1 if we introduced some new mechanism to enable/configure add-ons from within a spider, but that seems unnecessarily confusing. And frankly, I think syntax 1 is just as ugly, if not a little more, as syntax 2. ;D

Local configuration

First I want to point out that the (local) add-on configuration is not inaccessible for other add-ons or components. All add-on configurations are available to all components via the crawler.addons.configs dictionary. They just no longer clog up the global settings namespace. That being said, for backwards compatibility we should definitely keep the global settings for all builtin components. The idea I have for the builtin add-ons right now is that they simply take their configuration and expose it into the global settings namespace, much like in scrapy.addons.Addon.update_settings(). The components themselves won't be touched at all. This would allow users to configure built-in add-ons/component through both the new (add-on config) and old (settings module) way, and keep overwriting component configuration ad-hoc via Spider settings or the command line possible.

Add-on callback placement

Related with what I mentioned before, if we loose settings locality, is it required to instantiate the addons so early in the loading process? Could update_settings() and check_configuration() be class methods?

I'm not quite sure what you mean, aren't the callbacks already class methods in the common use case (i.e. when the add-on interface is provided by a class instance, such as in the scrapy.addons.Addon class)? I have used the call to Spider.update_settings() as orientation where the add-on callbacks should be placed. An add-on might want to replace the stats class, so I guess one line below where it is right now is where we should call them at latest. Then again, I just realised there's already some settings that cannot be changed in add-ons right now, like the spider loader class. I'm curious what you had in mind?

Add-on interface

+1 for dropping the requirement of check_configuration(), it is indeed very conceivable that many add-ons won't want to use this callback.

I like the additional attributes for further describing the add-on, and reusing setuptools/distutils/packaging etc., in particular for replacing the clumpy version check function I coded (I wasn't finished, I swear ;), but of course I could've figured that there's got to be a much better implementation ready out there).

As to the additional field for load order, i'm not really sure if that really helps us. Opposed to the current implementation of component orders, the user would have no (easy) influence on the add-on load order. That means there needs to be some kind of communication between devs, or a guide with recommended load order numbers for different use cases. While that might be doable, it seems to me that this will become hard to maintain if full-blown dependency management is introduced at some point.

Add-ons configuring other add-ons

I'm just keeping this in here as a reminder to myself to think about it. As I outlined earlier, I think as soon as the first update_settings() callback is touched, depending on whether we want to introduce the load order attribute, add-on configuration should be fixed. This only leaves the option of introducing a third (optional) callback, configure_addons(), that is called before update_settings().

jdemaeyer commented 8 years ago

7eb837d is a rough version of my idea for the builtin add-ons, and it'd be cool to know if this is in line with the maintainer's ideas before I start writing tests and documentation.

In principal, the add-ons don't do much. All they do is act as a gateway to the 'traditional' settings. Neither the components nor Scrapy's default settings have been touched, so configuring components via the old way still works as expected. Additionally, it allows changing the settings via the add-on mechanism. That is, if I want to enable and the configure the http cache middleware, instead of adding this to my settings.py:

HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 60
HTTPCACHE_IGNORE_HTTP_CODES = [500, 503]

I can now alternatively use this in my scrapy.cfg (or the equivalent in the ADDONS setting as outlined in the comments above):

[addon:httpcache]
expiration_secs = 60
ignore_http_codes = 500,503
; enabled is already implicitly set through the existence of this section

I think this is close to the idea of the original SEP-021, with the addition that each add-on has its own section.

If there ever comes a new milestone release where backwards-incompatible cleanup is encouraged, the add-on configuration could be removed from the global settings and live within the add-ons/components instead, making them fully independent of any Scrapy/settings ecosystem. Components with interdependencies could still expose some settings into the global settings namespace, or access other add-ons settings through the add-on manager. Or, as they are interdependent anyways, they could be configured through the same add-on.

curita commented 8 years ago

Sorry I took so long to reply, I'm still unsure on the alternatives for the add-ons configuration in settings.py, but I'll comment some thoughts I have about this matter. Most of them are things you already said but I wanted to write them to get my head around this.

Add-on configuration entry point I agree that it seems an odd design decision having the add-ons config in the settings file, but it definitely simplifies things. Out of both presented options, the ADDONS variable seems the more flexible one. I'm not crazy about it but I don't mind configuring add-ons this way in settings.py, django uses this approach with multiple nested dicts a lot.

What I'm unsure of is that I think we're overcomplicating how current components are configured. We would be changing these:

scrapy crawl myspider -s HTTPCACHE_EXPIRATION_SECS=60

class MySpider(scrapy.Spider):
    name = 'myspider'
    custom_settings = {
        HTTPCACHE_EXPIRATION_SECS=60,
    }

to these:

scrapy crawl myspider -s ADDONS={'httcache':{'expiration_secs': 60}}

class MySpider(scrapy.Spider):
    name = 'myspider'
    custom_settings = {
        'ADDONS': {
            'httpcache': {
                'expiration_secs: 60,
            }
        }
    }

and I think that's sort of an usability step back.

I wouldn't mind doing this in external components that I probably won't be configuring too often, but for builtin components I'm not so sure, this could become too tedious too soon. I know we're still supporting the current global settings, but if we're going to favor a single config entry point that one should be as carefully considered as possible.

For this reason I started to prefer your first design option since it looses the furthermost enclosing dictionary, but a) the add-ons configurations should be loaded with the other settings so they can be configured using commandline and custom_settings, though I think adding them as ADDONS_* isn't much of a problem besides cluttering a little the settings, and b) the ADDONS_ prefix isn't really much of a difference from the other approach, it's a workaround for defining a namespace.

So, I dislike a little having to explicitly tell you're configuring an add-on if that can be deduced some other way. A third option could be to copy some other functionality from django here (I wonder how many times that has been said in scrapy's history :smile:), I'm thinking something like:

INSTALLED_ADDONS = (
    'scrapy.addons.builtins.depth',  # or 'depth' maybe
    'scrapy.addons.builtins.httperror',
    ... # all builtins, this could be added to the settings.py template
    'path.to.mypipeline',
)

MYPIPELINE = { # uppercased add-on name
    'user': 'blah',
    ...
)

Not sure if possible though, but I kind of like copying the INSTALLED_APPS django setting, and the separated add-on config looks a lot like current settings:

HTTPCACHE = {
    'enabled': True,
    'expiration_secs': 60,
    'ignore_http_codes': [500, 503],
}

Add-on callback placement Don't mind the comment I made before, forgot how zope.interface.Interface worked and thought it was limiting update_settings and check_configuration to be only instance methods. Then I thought about having the possibility to call those methods without instantiating the class (if those are classmethods) but I don't see any advantage over instantiating it and calling them later. I like the current callback placement.

Builtin add-ons I really like the overall implementation minus a detail: I pictured adding components to their related component-dict settings slightly differently. I was thinking add-ons managing a single component could have a type and order fields in the interface (maybe order can be a config too) that could be used for automatically inserting themselves (or their path) to the component-dict setting defined in type.

I realize these builtin add-ons are sort of encapsulating the actual components, they have to add the components' paths to the settings and not themselves, but maybe the inserted path/obj could be parametrized for this particular case.

curita commented 8 years ago

Forgot to comment the error handling in Crawler.crawl() :sweat:

I don't have a strong preference here, I don't think your last implementation dealing with failures in every deferred looks that ugly. Maybe all those try/excepts could be flattened by handling callbacks manually (i.e. without inlineCallbacks), adding the corresponding errbacks for exception handling to the deferreds, but I think your version is more readable. I like either updating the engine.stop() method or adding a new engine.close() too.

For monitoring pending calls, I like relying on trial to check whether the reactor is clean, seems simpler and I think I read this was a good practice in the twisted testing guidelines. trial probably calls reactor.getDelayedCalls() for this, maybe we could call it explicitly but I think leaving this to the testing suite is probably the best option.

About assertFailure, we have tried to get rid of trial sometime in the past because it wasn't fully python3 compatible. Removing the trial.TestCase inheritance wasn't possible though because the pytest plugin we're using to mimic trial behavior doesn't work on unittest.TestCase classes. But seeing how we're already using assertFailure in other tests, and how we probably won't be changing that inheritance now I agree there isn't any problem using it.

jdemaeyer commented 8 years ago

A few small updates/thoughts:

I opted for a new ExecutionEngine.close() method instead of updating the stop() method. The reason is that I didn't want to remove the assert statement made in ExecutionEngine.stop() (after all the whole idea was to clean up even if the engine wasn't running yet). Code: close() method and usage in Crawler.crawl() I've also cleaned up the graceful shutdown test.

I've enhanced base Addon class: It is now much easier to configure single components through the component, component_type, and component_key attributes and the export_component method. Thank you for the idea @curita, does that look somewhat like what you had in mind? I've also updated the built-in add-ons to make use of that new feature, and they look much tidier now.

@curita makes a very valid point with how the ADDONS dictionary is a usability stepback. I like the Django-like INSTALLED_ADDONS approach and have added a corresponding method and test alongside the old methods. A few thoughts:

A stumbling block with this approach is that it's not possible to append, not overwrite, to the INSTALLED_ADDONS setting from the command line; not quite sure how to overcome that...
Similarly, we somehow need to communicate to users that they should use append in their Spider's update_settings
A way out of this could be to allow using both ADDONS and INSTALLED_ADDONS, however that would result in three entry points to the add-on framework. Not very streamlined :/

@kmike mentioned in one of his first posts if it would be possible that Spiders also implement the add-on interface and have their methods called through the same code path. I very much like that idea but so far couldn't wrap my head around a couple of questions: Should we support config for spiders and what would it be? Should the Spider be added to the add-on manager? Or should the add-on manager update_settings() etc. method have an optional keyword argument where the Spider is passed to call its callbacks?

Now that I've coded around the add-on manager quite a bit, I think @kmike was indeed right in that it's not very useful to save the used_keys, I think I'll drop them soon.

I have refactored the check_dependency_clashes() method so that it now uses pkg_resources. But I wonder if we can make even more use of that package?

Last, I realise that this PR is getting very large (+2,000 lines, phew) and therefore hard to review. However I don't see how it could be split up into smaller chunks. Any ideas how I could make it more clearly laid out?

curita commented 8 years ago

I opted for a new ExecutionEngine.close() method instead of updating the stop() method. [...]

Love how clean that solution resulted, if you want you can open a separate pull request with it and its test and we can merge it right away.

I've enhanced base Addon class: It is now much easier to configure single components through the component, component_type, and component_key attributes and the export_component method. Thank you for the idea @curita, does that look somewhat like what you had in mind?

That was pretty much what I had in mind :) Not sure if I'd have added the abbreviated versions of the component_types though, they seem rather cryptic for the sake of saving keystrokes, I'd prefer users to stick with the fullnames. Maybe I'd have added the component order as a class attr to match both component and component_key placements (in addition of being a config option), but this isn't necessary, and it could be taken the wrong way by thinking the order can't be changed.

A stumbling block with this approach [INSTALLED_ADDONS] is that it's not possible to append, not overwrite, to the INSTALLED_ADDONS setting from the command line; not quite sure how to overcome that...

That's a prevailing problem with any list valued setting :( I think as long as we can enable or disable add-ons using their configs (not sure if this is true for all add-ons though, maybe we could support a general enabled config in AddonManager?) it isn't a big drawback to have a fixed INSTALLED_ADDONS in settings.py listing all add-ons used across the spiders in a project. This is far from ideal (dynamically adding/removing add-ons to the list will be hard) but I think it's a reasonable compromise.

@kmike mentioned in one of his first posts if it would be possible that Spiders also implement the add-on interface and have their methods called through the same code path.

I don't recall right now any constrain on spiders being add-ons (hm, wait, calling the spider update_settings() at the same time as the other add-ons update_settings() couldn't conflict with the add-ons config? could a spider still disable another add-on this way?), but if there aren't any issues (f8cdf1a?) let's implement it :+1:

Should we support config for spiders and what would it be?

If possible, I'd like to see spider arguments mapped into the config. A neat side effect of this would be that spider arguments could be configured like any other regular setting.

Should the Spider be added to the add-on manager? Or should the add-on manager update_settings() etc. method have an optional keyword argument where the Spider is passed to call its callbacks?

The second option is the one you implemented in f8cdf1a right? I like it, I think this distinction should be helpful at some point.

I have refactored the check_dependency_clashes() method so that it now uses pkg_resources. But I wonder if we can make even more use of that package?

Not sure myself, I thought we could have a custom Addon class for add-ons defined inside packages and get version and requires from that package using pkg_resources but I'm not sure if that's even something that should be done with it.

Last, I realise that this PR is getting very large (+2,000 lines, phew) and therefore hard to review. However I don't see how it could be split up into smaller chunks. Any ideas how I could make it more clearly laid out?

Cleaning up the commit history should be a must to get the pr fully reviewed once it's done, but I don't see how its readability could be improved any further. Your code is really well documented so don't mind its extension too much :)

jdemaeyer commented 8 years ago

If possible, I'd like to see spider arguments mapped into the config. A neat side effect of this would be that spider arguments could be configured like any other regular setting.

(Comment moved to #1442)

jdemaeyer commented 8 years ago

I think as long as we can enable or disable add-ons using their configs (not sure if this is true for all add-ons though, maybe we could support a general enabled config in AddonManager?)

calling the spider update_settings() at the same time as the other add-ons update_settings() couldn't conflict with the add-ons config? could a spider still disable another add-on this way?

I added the possibility to disable any add-on by setting its config _enabled setting to False, that should allow disabling add-ons both from the Spider's update_addons() method as well as from the command line.

Once again, since there is currently no guarantee in which order add-on callbacks are called, setting an add-ons config['_enabled'] = False from a second add-on may not prevent the first add-ons update_addons() method from being called. However, it just struck me that, should we settle for the INSTALLED_ADDONS list setting, the user now has the possibility to order add-ons, so maybe I could replace the AddonManager._addons dictionary with an ordered dictionary (also, INSTALLED_ADDONS could be a dictionary-like {'addonpath': order} setting instead of a list, just like the component settings. That would also resolve the "not easy to append instead of overwrite" problem).

Note to self so I don't forget to remove some code later: 2039892 made the _source config superfluos

codecov-io commented 8 years ago

Current coverage is `83.42%`

Merging #1272 into master will increase coverage by +0.69% as of 48b077b

Powered by Codecov. Updated on successful CI builds.

jdemaeyer commented 8 years ago

hm, wait, calling the spider update_settings() at the same time as the other add-ons update_settings() couldn't conflict with the add-ons config? could a spider still disable another add-on this way?)

(Comment moved to #1442)

jdemaeyer commented 8 years ago

Rebased onto #1149 and squashed the earlier commits (up until a0816fa) into groups to ease reviewing :)

jdemaeyer commented 8 years ago

Ready for review! I know it's huge, sorry :)

jdemaeyer commented 8 years ago

Rebased.

While this is a 'ready' implementation, there are some design questions that should be reconsidered.

Bold alternatives mark design changes I have implemented in later commits.

Major:

What	Current Implementation	Alternatives
Add-on configuration entry point	Preferredly, add-ons are enabled through a tuple setting, `INSTALLED_ADDONS` (inspired by Django), and configured through an `ADDONNAME` setting. Additionally (but not preferred), add-ons can also be enabled and configured with one section per add-on in `scrapy.cfg`.	Make `INSTALLED_ADDONS` (maybe just `ADDONS`?) a dictionary, so it is consistent with the middleware/extension settings Make configuration through `scrapy.cfg` the preferred way but keep configuration through settings to allow using add-ons outside of projects and ad-hoc configuration in the command line (emphasises the "add-ons are above settings" idea) Completely drop support for `scrapy.cfg` to avoid splitting up the configuration entry point (somewhat opposes the "supersimple to configure for users" goal) Drop support for configuration through settings in favour of `scrapy.cfg` (emphasises the 'add-ons are above settings' idea but makes using add-ons outside of projects, and configuring add-ons from the command line, impossible)

Inbetween:

What	Current Implementation	Alternatives
Add-ons enabling/configuring other add-ons	In a `update_addons()` callback (and only there), add-ons can load/enable and configure other add-ons. They cannot use the `INSTALLED_ADDONS` setting but must go through the add-on manager methods.	Drop support for add-ons configuring other add-ons (renders creating "umbrella add-ons" impossible, and makes add-ons slightly less user-friendly because add-ons cannot automatically enable required other add-ons)
Callback signatures	The callback signature are quite restrictive to avoid add-on authors misusing them, i.e. the `update_addons()` callback only receives the add-on manager but not the settings, the `update_settings()` callback gets only the settings but not the crawler or the add-on manager.	Hand over the crawler to all callbacks to make them more versatile and rely on developers to know what they're doing
Add-ons for builtin extensions/middlewares	For every builtin extension/middleware, there is a corresponding add-on. This add-on simply proxies the configuration into the global settings. Users can now do most of the configuration in `scrapy.cfg` through these proxy add-ons if they wish	Drop built-in addons to avoid creating a new configuration entry point for the built-in components
Dependency management	Add-ons can provide information about their dependencies through `provides`, `modifies` and `requires` attributes. The `check_dependency_clashes()` function checks these for clashes with the help of `pkg_resources`.	Make sure that these really are the three attributes we want to use Outsource dependency management into a new PR to allow dedicated development/discussion there while moving forward with the add-on framework

Minor:

What	Current Implementation	Alternatives
Add-on manager name	`AddonManager`	I totally see your point @kmike, I just can't think of a better name ;) the add-on manager is very multi-purpose (enable addons, find addons, provide configuration entry point, check dependencies, call callbacks, ...). Maybe it is too monolithic?
Finding add-ons	When passing `httpcache` to the `AddonManager.get_addon()` method, it will first look for a module/object `httpcache` on the Python path, then for `projectname.addons.httpcache`, then for `scrapy.addons.httpcache`. This allows slightly more user-friendly addressing of add-ons.	Drop looking for `projectname.addons.` because the project name can not always be detected reliably, possibly leading to inconsistent behaviour Drop looking for `projectname.addons.` as well as for `scrapy.addons.`, forcing the user to always write out the full path, consistent with the middlewares and extensions
Interface declaration	The `scrapy.interfaces.IAddon` zope interface only requires the existence of a `name` and `version` attribute.	Find out how to enforce the signature of a method without enforcing its existence (add-ons should be able to choose to implement only one or two callbacks)

jdemaeyer commented 8 years ago

My humble opinion on these:

Since the summer, I've changed my mind about a few things quite a bit. Given that Scrapy is a framework rather than a do-it-all product with a shiny GUI, I think the focus really should lie on consistency and versatility instead of abandoning these for user-friendliness.

I reeeaally love the ease and syntax of managing add-ons from scrapy.cfg. But I fear that allowing two configuration points will bite us in the ass at some point. Right now I tend toward completely dropping the scrapy.cfg support (after all, how many users won't be able to handle the simple Python syntax in settings.py)?

Somewhat as a consequence, I don't really see a reason to keep the built-in add-ons without scrapy.cfg support. While the cfg syntax would indeed be much more beautiful, within settings.py there really isn't much advantage of this:

HTTPCACHE = {
    'dir': 'blah',
    'expiration_secs': 99,
}

over this:

HTTPCACHE_DIR = 'blah'
HTTPCACHE_EXPIRATION_SECS = 99

With no built-in add-ons, there also isn't much reason to keep the "prefix scrapy.addons." logic in get_addon() either.

For the sake of versality, I think all callbacks should receive the crawler object (or maybe (crawler, settings, addons) as shortcuts for developers), and that the "add-ons configuring other add-ons" logic should stay in, with a fat note to developers that they cannot use the INSTALLED_ADDONS settings but must use crawler.addons.add().

For consistency with middlewares/extensions, I think it makes sense to change the INSTALLED_ADDONS tuple into a ADDONS dictionary, with the order existent but probably not really relevant (as with EXTENSIONS). This should also make it easier to disable add-ons from the command line.

jdemaeyer commented 8 years ago

Just making sure I haven't deviated too much from the original idea while working on this (yay for tunnel vision):

The main goals of add-ons are:

As far as possible, free the users from having to configure Scrapy's internals to enable specific functionality. Optimally, to enable a desired functionality extension, they only have to touch the ADDONS setting (and possibly one other setting per add-on where configuration for that add-on lives), and no longer need to touch DOWNLOADER_MIDDLEWARES and friends.
To make this possible, add-ons provide a mechanism for developers to take care of Scrapy's internal configuration, i.e. a place to bundle configuration requirements and checks.

Add-ons could then be defined as follows:

Add-ons are pluggable configuration units. An add-on contains a set of configuration settings and checks. All Add-ons are enabled at a single entry point.

Everyone d'accord? :)

kmike commented 8 years ago

I'm not super-familiar with the original idea, but this description sounds perfect, +1 to have an implementation in Scrapy.

kmike commented 8 years ago

A nice table, makes it much easier to understand the PR :+1:

Add-on configuration entry point - I think we should go API-first; I don't see a point of having two files with options. INI syntax is a tiny bit nicer, but that's all - the syntax doesn't transfer to Spider.custom_settings or Crawler/CrawlerRunner/CrawlerProcess settings, and it is not clear how to define and access your own settings in scrapy.cfg. So my vote is to drop addons support in scrapy.cfg - as @jdemaeyer said, Scrapy is a framework, not an application. This way we'll have one less concept to learn for users - there are already too many of them in Scrapy.
Add-ons enabling/configuring other add-ons - an option for addons to configure other addons looks nice, +1 to keep it.
Callback signatures - I like that addons don't know about a crawler until check_configuration is called. It makes it possible to build an almost final Settings object without a Crawler (?). Are there use cases for passing a Crawler earlier?
Add-ons for builtin extensions/middlewares - the implementation doesn't look DRY - priorities and default options are currently duplicated (?). I think it makes sense to convert extensions to addons only if they have a use of addon features. Maybe short addon names is enough to justify creating addons for built-in components though. But it should be DRY.
Dependency management - it may be nice to have, but my vote is to do it later, in another PR. Even better if this feature would be justified by real addons, i.e. we may release without it, then check what's going on in users' check_configuration methods (?) and then add helpers for common cases.
Add-on manager name - What do you think about the following?
1. rename the class to Addons - it represents a set of addons, not something wich just manages a set of externally maintained addons;
2. rename load_dict to from_dict and make it a classmethod (an alternative constructor);
3. rename load_settings to from_settings and make it a classmethod (an alternative constructor);
4. rename load_cfg to from_cfg and make it a classmethod (one more alternative constructor)
5. make it possible to construct a merged Addons instance from other Addons instances, to allow populating addons from multiple sources.
(2-5) is very minor, to decrease usage of mutable state.
Finding add-ons - My vote is to support 1. setuptools entrypoints and 2. python paths. 3. nothing more - no magic prefixes, no local/relative file paths. myproject.addons.Foo is better than Foo because it clearly shows that addon is a custom one defined in this project. With setuptools entrypoints there is no need for scrapy.addons, and pypi packages can provide shortened addon names. But maybe we don't even need entry points, just Python paths can be fine.
Interface declaration - sounds fine; I don't care about formal interfaces much though.
+1 to use ADDONS name instead of INSTALLED_ADDONS. Regarding priorities - hm, why don't they matter?

jdemaeyer commented 8 years ago

Callback signatures - I like that addons don't know about a crawler until check_configuration is called. It makes it possible to build an almost final Settings object without a Crawler (?). Are there use cases for passing a Crawler earlier?

There might be some edge cases, like needing access to the signals really early, although I don't see a concrete example right now. Another edge case might be monkeypatching the crawler methods by overwriting them... Passing the add-on manager to update_settings() might be useful when the settings depend on other add-ons' configurations.

I don't think these outweigh the pros of being able to construct an almost final (why almost?) settings object without a crawler though.

Add-ons for builtin extensions/middlewares - the implementation doesn't look DRY - priorities and default options are currently duplicated (?). I think it makes sense to convert extensions to addons only if they have a use of addon features. Maybe short addon names is enough to justify creating addons for built-in components though. But it should be DRY.

The default options still live in default_settings.py, but the priorities are indeed duplicated. That was a preparation for being able to replace the XY_BASE settings (or the defaults in XY depending on how we proceed with #1149) with just a default ADDONS_BASE (or ADDONS) setting, but for now it should probably be removed (we can still do that later since the internal structure of the built-in add-ons is not public API).

Upon further thought I think the built-in add-ons should stay. After all if we promote a single configuration entry point for "functionality enhancements" by using ADDONS over DOWNLOADER_MIDDLEWARES and similar, we should also promote using

ADDONS = {
    # ...
    'httpcache': 0,
}

over

HTTPCACHE_ENABLED = True

Dependency management - it may be nice to have, but my vote is to do it later, in another PR.

Add-on manager name - What do you think about the following? rename the class to Addons - it represents a set of addons, not something wich just manages a set of externally maintained addons;

I like that. Maybe AddonSet? And load_dict and similar could then be renamed to add_from_dict, that would make it look more like Python's built-in set. For ii-v, I don't see the advantage that would offset the hassle of having to merge different Addons instances.

Finding add-ons - My vote is to support 1. setuptools entrypoints and 2. python paths

+1 for supporting entry points. The scrapy.addons prefix is purely for the built-in addons (so users can write httpcache instead of scrapy.addons.httpcache).

+1 to use ADDONS name instead of INSTALLED_ADDONS. Regarding priorities - hm, why don't they matter?

It's not that they don't matter at all, but that they probably don't matter as add-ons probably won't be cross-dependent too much. Just like the priorities of EXTENSIONS often don't matter.

jdemaeyer commented 8 years ago

Rebased onto #1586 and implemented the following design changes:

Replaced INSTALLED_ADDONS tuple with ADDONS dictionary as entry point for add-on configuration
Completely dropped support for configuring add-ons in scrapy.cfg
Dropped the prefix magic from BaseSettings.get_addon(). Enabling the http cache now requires ADDONS = {'scrapy.addons.httpcache': 0} instead of ADDONS = {'httpcache': 0}. This makes it consistent with the other component settings and prevents name shadowing issues
Removed replication of default settings for the builtin add-ons

chekunkov commented 8 years ago

Add-on configuration entry point

Completely dropped support for configuring add-ons in scrapy.cfg

Replaced INSTALLED_ADDONS tuple with ADDONS dictionary as entry point for add-on configuration

hm, -1 if the only reason of this change is "so it is consistent with the middleware/extension settings"

Add-ons enabling/configuring other add-ons

Can you point me to the part of doc which describes how this cross-configuration is going to work?

Callback signatures

generally I like the idea of being more strict, but it should be justified - what could go wrong if we give

The callback signature are quite restrictive to avoid add-on authors misusing them

what is understood by misuse? example?

Add-ons for builtin extensions/middlewares

agree with @kmike https://github.com/scrapy/scrapy/pull/1272#issuecomment-152682302

also one of the current implementation point is irrelevant now as scrapy.cfg support is dropped - "Users can now do most of the configuration in scrapy.cfg through these proxy add-ons if they wish"

Dependency management

+1 to separate PR, agree with @kmike https://github.com/scrapy/scrapy/pull/1272#issuecomment-152682302

Add-on manager name

Honestly don't see a big conceptual difference between AddonsManager and Addons as far as it doesn't change how it works :)

(2-5) is very minor, to decrease usage of mutable state.

load_cfg is already removed, so there are only load_settings and load_dict to decide. load_settings is used once during Crawler.__init__. what's expected usage of load_dict? I see implementation, what is expected usecase?

Finding add-ons

Enabling the http cache now requires ADDONS = {'scrapy.addons.httpcache': 0} instead of ADDONS = {'httpcache': 0}

+1 to remove damn magic

Interface declaration

@jdemaeyer can you explain what's the problem with current implementation?

jdemaeyer commented 8 years ago

Thanks alot for the feedback you guys!

Replaced INSTALLED_ADDONS tuple with ADDONS dictionary as entry point for add-on configuration

hm, -1 if the only reason of this change is "so it is consistent with the middleware/extension settings"

It's basically the same rationale with which ITEM_PIPELINES became a dictionary and no longer a list. I have to admit that I still like the idea of ADDONS being a tuple/list though. As the "update-not-replace" is behaviour has been revoked (#1586), lists/tuples are again not harder to update from custom_settings or the command line than dicts. On the other hand, being able to disable components by setting their dictionary value to None comes in handy at times.

Add-ons enabling/configuring other add-ons

Can you point me to the part of doc which describes how this cross-configuration is going to work?

Here's an example: https://github.com/jdemaeyer/scrapy/blob/enhancement/addons/docs/topics/addons.rst#enabling-and-configuring-add-ons-within-python-code

Callback signatures

generally I like the idea of being more strict, but it should be justified - what could go wrong if we give what is understood by misuse? example?

I don't think there's much that could really "go wrong", it's more an notion of guiding people. E.g. in update_addons(), they should not make decisions based on setting values, as these setting values are still subject to change (e.g. in other add-ons' update_settings()). Not putting the crawler into the callback signature has the advantage pointed out by Mike above (being able to construct a final settings object without needing a crawler).

Add-ons for builtin extensions/middlewares

I've removed the repetition of the default settings. Most of the built-in add-ons only provide a second way to configure the built-in extensions, I think we should drop these.

There are some that will overwrite default settings when activated. E.g. enabling the httpcache add-on will automatically set HTTPCACHE_ENABLED = True. What do you guys think about these? I think they make sense when we want to push users towards enabling extra functionality (built-in or not) through the ADDONS setting.

what's expected usage of load_dict? I see implementation, what is expected usecase?

It's a low-maintenance convenience helper for working with the AddonManager. There's no use in Scrapy itself but I've used it in some of the tests.

Interface declaration

can you explain what's the problem with current implementation?

It's a minor detail really, I just wanted the list to be complete ;) I don't want to enforce that add-ons have an update_settings() method (they might not need one after all). But when they have one, I want to enforce our call signature. However I don't think that's possible with zope.interface right now.

jdemaeyer commented 8 years ago

Rebased onto master, this PR now once again contains no commits from other PRs. (Finally the green checkmark is back :))

pawelmhm commented 7 years ago

this is so huge, is there some way to save pieces of this work by splitting this into more smaller PR-s? e.g. I'm looking forward to changes in load_object, is it ok if I steal it from here (cherry-pick) and create new PR with that? https://github.com/scrapy/scrapy/pull/1272/files#diff-2536f98aac2c474ec1f424c8a943b9d2

coveralls commented 7 years ago

Changes Unknown when pulling a65fc0db7d7d728bc8abbb37eff358870ce7a9b5 on jdemaeyer:enhancement/addons into on scrapy:master.

scrapy / scrapy

Add-ons #1272

Integration into existing start-up process

Outside of crawler

Inside of crawler

Add-on configuration entry point

Add-ons without a project

Local configuration

Add-on callback placement

Add-on interface

Add-ons configuring other add-ons

Current coverage is 83.42%

Add-on configuration entry point

Add-ons enabling/configuring other add-ons

Callback signatures

Add-ons for builtin extensions/middlewares

Dependency management

Add-on manager name

Finding add-ons

Interface declaration

Current coverage is `83.42%`