Design auto-registration of pipelines

antonymilne commented 2 years ago

Following a discussion in backlog grooming, the idea of auto-registering pipelines met with general approval so this is a ticket to design how to do it. See https://github.com/kedro-org/kedro/issues/1078 for original context and motivation.

The end goal

When I do kedro pipeline create it creates the following structure:

pipelines
├── __init__.py
├── a
│   ├── README.md
│   ├── __init__.py      # exposes __all__ = ["create_pipeline"]
│   ├── nodes.py
│   └── pipeline.py      # contains def create_pipeline
├── b
│   ├── ...
└── c
    ├── ...

Assuming they're following the above structure, a user should be able to run kedro run --pipeline=a _without needing to edit pipeline_registry.py at all_. kedro run should run all pipelines, i.e. we have __default__ = a + b + c. It should be possible for a user to overwrite these automatic registrations if they want to by editing pipeline_registry.py as they can now.

Ultimately the above structure should result in a pipeline_registry.py that acts like the following (but does not actually have this code):

from spaceflights.pipelines import a, b, c

def register_pipelines(self) -> Dict[str, Pipeline]:
    a = a.create_pipeline()
    b = b.create_pipeline()
    c = c.create_pipeline()

    return {
        "__default__": a + b + c,
        "a": a,
        "b": b,
        "c": c,
    }

Proposed implementation

Something that is very roughly like this:

def get_default_registered_pipelines(): 
    for pipeline in Path("pipelines").iterdir():
        importlib.import_module(pipeline)
        if hasattr(pipeline, "create_pipeline"):
            registered_pipelines[pipeline] = pipeline.create_pipeline()
    registered_pipelines["__default__"] = sum(registered_pipelines.values())    # it's cool we can do this now
    return registered_pipelines

# pipeline_registry.py
def register_pipelines() -> Dict[str, Pipeline]:
    return get_default_registered_pipelines()

Then, if wanted, a user could change the default behaviour like this:

def register_pipelines():
    defaults = get_default_registered_pipelines()
    defaults["a"] = a1.create_pipeline() + a2.create_pipeline()
    my_other_pipeline_definitions = {"d": a.create_pipeline() + b.create_pipeline()}
    return {**defaults, **my_other_pipeline_definitions}

Questions: Where should get_default_registered_pipelines go? The Zen of Kedro says A sprinkle of magic is better than a spoonful of it, which suggests maybe it goes in pipeline_registry.py itself. But maybe it's confusing for a user to have this weird looking code in such a core user-facing file (like hooks.py seemed to me when I first saw it)? So maybe better to have it defined on framework-side and then done as import kedro.pipeline... instead?

Alternative implementations

Something clever in kedro.project._ProjectPipelines that automatically registers pipelines. Sounds a bit too magical to me though - I prefer the explicitness of the above.
Actually edit the code in pipeline_registry.py to rewrite the Python dictionary when you do kedro pipeline create. Sounds totally horrible though.

antonymilne commented 2 years ago

On second thoughts I think it's clear that the get_default_registered_pipelines function should go on the framework-side. This would be consistent with moving the _find_run_command etc. functions to the framework side as discussed in #1423.

One other thing I'm wondering though is a more extreme Part 2 of this where we remove the pipeline_registry.py file altogether. This would be consistent with how we have removed the hooks.py and cli.py files from the default template. A user would still be able to create the file themselves if they want to modify the default behaviour (get_default_registered_pipelines). I think there are two possible models here:

if pipeline_registry.py doesn't exist, just use get_default_registered_pipelines. If it exists and has register_pipelines function then use that instead. This is analogous to the behaviour of cli.py
put something in settings.py like REGISTER_PIPELINES_FUNCTION which defaults to framework-side get_default_registered_pipelines. A user can then override this as they please with a path to their own custom register_pipelines function which could in theory live anywhere. This is analogous to behaviour of hooks.py, config loader and others

antonymilne commented 2 years ago

Support for this idea in https://github.com/kedro-org/kedro/discussions/1436 to enable a plugin that does yaml pipeline definitions.

idanov commented 2 years ago

Suggestion number 2 in the previous comment seems most useful, although I would make the default function be the current one in pipeline_registry.py, but you could turn on/off autoregistering the pipleines by changing it to a built-in function for autoregistry. I don't think it is a good idea to remove pipeline_registry.py by default, since it's the entrypoint to the application and most users will look for it, unlike cli.py which only advanced users change. So my preferred behaviour would be:

Add a PIPELINE_REGISTRY_FUNCTION, which is <package>.pipeline_registry.register_pipelines by default (i.e. the same as the current behaviour)
Provide a helper function, called autoregister_pipelines, which could be imported and used in settings.py to be set as PIPELINE_REGISTRY_FUNCTION, and which will do what your get_default_registered_pipelines is doing and call <package>.pipeline_registry.register_pipelines at the end or...
Alternatively, PIPELINE_REGISTRY_FUNCTION can take an array of functions and will merge their result at the end (with clear overriding order), e.g. people can set it to PIPELINE_REGISTRY_FUNCTION = [ autoregister_pipelines, register_pipelines ]

Number 3 seems very powerful and very simple to implement.

antonymilne commented 2 years ago

Thanks for the comments @idanov. I like your idea 3 a lot.

Building on it, the only thing I wonder is what the default value of PIPELINE_REGISTRY_FUNCTION should be:

PIPELINE_REGISTRY_FUNCTION = register_pipelines (effectively same as now)
PIPELINE_REGISTRY_FUNCTION = autoregister_pipelines (would be breaking change 👎 )
PIPELINE_REGISTRY_FUNCTION = [autoregister_pipelines, register_pipelines] (non-breaking since register_pipelines would overwrite autoregister_pipelines 🎉 )

Following our current model in which a user doesn't need to touch settings.py unless they're trying to do something relatively advanced/customised, I would say that ultimately the default value should be the one which is most commonly useful for beginner users. This would be option 2 or 3, since then a beginner user doesn't need to touch settings.py or pipeline_register.py in order to run a simple kedro project (e.g. I could do the whole spaceflights tutorial without needing to touch those files at all).

However, although option 3 is non-breaking, it would be a bit of a departure from current behaviour. So my feeling is probably option 1 is right for now, and we give option 2 and/or 3 as commented-out suggestions in settings.py (like we do with TemplatedConfigLoader) etc. Then we can always revisit in the future, depending on user feedback about pipeline autoregistration.

antonymilne commented 2 years ago

On second thoughts, I'm not sure how much I like idea 3... I'm guessing that a common pattern would be:

use autoregister_pipelines to setup pipelines dictionary
customise that dictionary in some way, e.g. to overwrite pipelines["existing_key"] or create a new pipelines["new_key"] that uses already-registered pipelines

On point 2, as in my original example, what I would like to do is something like this:

pipelines["existing_key"] = pipelines["existing_key"].filter(...)
pipelines["new_key"] = pipelines["existing_key_1"] + pipelines["existing_key_2"]

The sequential nature of idea 3 means that this wouldn't be possible unless we let register_pipelines take an input argument of pipelines that it can somehow mutate. I would instead need to call autoregister_pipelines from inside my register_pipelines function to compose the two functions together and then give a single function in PIPELINE_REGISTRY_FUNCTION. So I don't now see how the extra power (and complexity) of idea 3 would actually be helpful in most cases - do you have some particular ideas of when the extra power would be helpful?

antonymilne commented 2 years ago

Notes from technical design on 29 June:

concerns that if we didn't make autoregister_pipelines the default then beginner users wouldn't be aware of its existence and would still have to alter code in settings.py to activate it
this is in tension with setting PIPELINE_REGISTRY_FUNCTION = autoregister_pipelines by default, which would be a breaking change
@noklam questioned how people would be be able to do creative things in their register_pipelines function, e.g. dynamically generate them. I think we should consider whether register_pipelines takes some argument like extra_params (as it used to with hooks) or whatever happens with "metaparameters" as part of config reworking. I have several examples where this would be useful.
@MerelTheisenQB thought removing pipeline_registry.py file might be a good idea

Conclusion:

forget about settings.py for now and just implement this through a framework-side autoregister_pipelines function + editing pipeline_registry.py
hence current behaviour of kedro using pipeline_registry.register_pipeline remains, no new options in settings.py
what changes is that the default project template register_pipeline function will, instead of being empty, import and use autoregister_pipelines
this support my idea of composition above (i.e. call autoregister_pipelines and then modify the dictionary it returns)
new kedro projects will then use autoregister_pipelines by default (modify all starters to use it); existing kedro projects will remain unchanged but people can modify their pipeline_registry to use autoregister_pipelines if they like
this is completely non-breaking and we get the feature in users' hands more quickly
nice and explicit, and neither beginner nor advanced users need to touch settings.py for now

Questions: in the future would we still add the settings.py option and/or remove pipeline_registry.py?

antonymilne commented 2 years ago

To be implemented in #1664. Following discussion with Ivan, we decided there's no need to add an option for settings.py any more.

kedro-org / kedro