kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.82k stars 895 forks source link

Design auto-registration of pipelines #1284

Closed antonymilne closed 2 years ago

antonymilne commented 2 years ago

Following a discussion in backlog grooming, the idea of auto-registering pipelines met with general approval so this is a ticket to design how to do it. See https://github.com/kedro-org/kedro/issues/1078 for original context and motivation.

The end goal

When I do kedro pipeline create it creates the following structure:

pipelines
├── __init__.py
├── a
│   ├── README.md
│   ├── __init__.py      # exposes __all__ = ["create_pipeline"]
│   ├── nodes.py
│   └── pipeline.py      # contains def create_pipeline
├── b
│   ├── ...
└── c
    ├── ...

Assuming they're following the above structure, a user should be able to run kedro run --pipeline=a _without needing to edit pipeline_registry.py at all_. kedro run should run all pipelines, i.e. we have __default__ = a + b + c. It should be possible for a user to overwrite these automatic registrations if they want to by editing pipeline_registry.py as they can now.

Ultimately the above structure should result in a pipeline_registry.py that acts like the following (but does not actually have this code):

from spaceflights.pipelines import a, b, c

def register_pipelines(self) -> Dict[str, Pipeline]:
    a = a.create_pipeline()
    b = b.create_pipeline()
    c = c.create_pipeline()

    return {
        "__default__": a + b + c,
        "a": a,
        "b": b,
        "c": c,
    }

Proposed implementation

Something that is very roughly like this:

def get_default_registered_pipelines(): 
    for pipeline in Path("pipelines").iterdir():
        importlib.import_module(pipeline)
        if hasattr(pipeline, "create_pipeline"):
            registered_pipelines[pipeline] = pipeline.create_pipeline()
    registered_pipelines["__default__"] = sum(registered_pipelines.values())    # it's cool we can do this now
    return registered_pipelines

# pipeline_registry.py
def register_pipelines() -> Dict[str, Pipeline]:
    return get_default_registered_pipelines()

Then, if wanted, a user could change the default behaviour like this:

def register_pipelines():
    defaults = get_default_registered_pipelines()
    defaults["a"] = a1.create_pipeline() + a2.create_pipeline()
    my_other_pipeline_definitions = {"d": a.create_pipeline() + b.create_pipeline()}
    return {**defaults, **my_other_pipeline_definitions}

Questions: Where should get_default_registered_pipelines go? The Zen of Kedro says A sprinkle of magic is better than a spoonful of it, which suggests maybe it goes in pipeline_registry.py itself. But maybe it's confusing for a user to have this weird looking code in such a core user-facing file (like hooks.py seemed to me when I first saw it)? So maybe better to have it defined on framework-side and then done as import kedro.pipeline... instead?

Alternative implementations

antonymilne commented 2 years ago

On second thoughts I think it's clear that the get_default_registered_pipelines function should go on the framework-side. This would be consistent with moving the _find_run_command etc. functions to the framework side as discussed in #1423.

One other thing I'm wondering though is a more extreme Part 2 of this where we remove the pipeline_registry.py file altogether. This would be consistent with how we have removed the hooks.py and cli.py files from the default template. A user would still be able to create the file themselves if they want to modify the default behaviour (get_default_registered_pipelines). I think there are two possible models here:

  1. if pipeline_registry.py doesn't exist, just use get_default_registered_pipelines. If it exists and has register_pipelines function then use that instead. This is analogous to the behaviour of cli.py
  2. put something in settings.py like REGISTER_PIPELINES_FUNCTION which defaults to framework-side get_default_registered_pipelines. A user can then override this as they please with a path to their own custom register_pipelines function which could in theory live anywhere. This is analogous to behaviour of hooks.py, config loader and others
antonymilne commented 2 years ago

Support for this idea in https://github.com/kedro-org/kedro/discussions/1436 to enable a plugin that does yaml pipeline definitions.

idanov commented 2 years ago

Suggestion number 2 in the previous comment seems most useful, although I would make the default function be the current one in pipeline_registry.py, but you could turn on/off autoregistering the pipleines by changing it to a built-in function for autoregistry. I don't think it is a good idea to remove pipeline_registry.py by default, since it's the entrypoint to the application and most users will look for it, unlike cli.py which only advanced users change. So my preferred behaviour would be:

  1. Add a PIPELINE_REGISTRY_FUNCTION, which is <package>.pipeline_registry.register_pipelines by default (i.e. the same as the current behaviour)
  2. Provide a helper function, called autoregister_pipelines, which could be imported and used in settings.py to be set as PIPELINE_REGISTRY_FUNCTION, and which will do what your get_default_registered_pipelines is doing and call <package>.pipeline_registry.register_pipelines at the end or...
  3. Alternatively, PIPELINE_REGISTRY_FUNCTION can take an array of functions and will merge their result at the end (with clear overriding order), e.g. people can set it to PIPELINE_REGISTRY_FUNCTION = [ autoregister_pipelines, register_pipelines ]

Number 3 seems very powerful and very simple to implement.

antonymilne commented 2 years ago

Thanks for the comments @idanov. I like your idea 3 a lot.

Building on it, the only thing I wonder is what the default value of PIPELINE_REGISTRY_FUNCTION should be:

  1. PIPELINE_REGISTRY_FUNCTION = register_pipelines (effectively same as now)
  2. PIPELINE_REGISTRY_FUNCTION = autoregister_pipelines (would be breaking change 👎 )
  3. PIPELINE_REGISTRY_FUNCTION = [autoregister_pipelines, register_pipelines] (non-breaking since register_pipelines would overwrite autoregister_pipelines 🎉 )

Following our current model in which a user doesn't need to touch settings.py unless they're trying to do something relatively advanced/customised, I would say that ultimately the default value should be the one which is most commonly useful for beginner users. This would be option 2 or 3, since then a beginner user doesn't need to touch settings.py or pipeline_register.py in order to run a simple kedro project (e.g. I could do the whole spaceflights tutorial without needing to touch those files at all).

However, although option 3 is non-breaking, it would be a bit of a departure from current behaviour. So my feeling is probably option 1 is right for now, and we give option 2 and/or 3 as commented-out suggestions in settings.py (like we do with TemplatedConfigLoader) etc. Then we can always revisit in the future, depending on user feedback about pipeline autoregistration.

antonymilne commented 2 years ago

On second thoughts, I'm not sure how much I like idea 3... I'm guessing that a common pattern would be:

  1. use autoregister_pipelines to setup pipelines dictionary
  2. customise that dictionary in some way, e.g. to overwrite pipelines["existing_key"] or create a new pipelines["new_key"] that uses already-registered pipelines

On point 2, as in my original example, what I would like to do is something like this:

pipelines["existing_key"] = pipelines["existing_key"].filter(...)
pipelines["new_key"] = pipelines["existing_key_1"] + pipelines["existing_key_2"]

The sequential nature of idea 3 means that this wouldn't be possible unless we let register_pipelines take an input argument of pipelines that it can somehow mutate. I would instead need to call autoregister_pipelines from inside my register_pipelines function to compose the two functions together and then give a single function in PIPELINE_REGISTRY_FUNCTION. So I don't now see how the extra power (and complexity) of idea 3 would actually be helpful in most cases - do you have some particular ideas of when the extra power would be helpful?

antonymilne commented 2 years ago

Notes from technical design on 29 June:

Conclusion:

Questions: in the future would we still add the settings.py option and/or remove pipeline_registry.py?

antonymilne commented 2 years ago

To be implemented in #1664. Following discussion with Ivan, we decided there's no need to add an option for settings.py any more.