kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
Apache License 2.0
9.82k stars 895 forks source link

Design auto-registration of pipelines #1284

Closed antonymilne closed 2 years ago

antonymilne commented 2 years ago

Following a discussion in backlog grooming, the idea of auto-registering pipelines met with general approval so this is a ticket to design how to do it. See for original context and motivation.

The end goal

When I do kedro pipeline create it creates the following structure:

├── a
│   ├──
│   ├──      # exposes __all__ = ["create_pipeline"]
│   ├──
│   └──      # contains def create_pipeline
├── b
│   ├── ...
└── c
    ├── ...

Assuming they're following the above structure, a user should be able to run kedro run --pipeline=a _without needing to edit at all_. kedro run should run all pipelines, i.e. we have __default__ = a + b + c. It should be possible for a user to overwrite these automatic registrations if they want to by editing as they can now.

Ultimately the above structure should result in a that acts like the following (but does not actually have this code):

from spaceflights.pipelines import a, b, c

def register_pipelines(self) -> Dict[str, Pipeline]:
    a = a.create_pipeline()
    b = b.create_pipeline()
    c = c.create_pipeline()

    return {
        "__default__": a + b + c,
        "a": a,
        "b": b,
        "c": c,

Proposed implementation

Something that is very roughly like this:

def get_default_registered_pipelines(): 
    for pipeline in Path("pipelines").iterdir():
        if hasattr(pipeline, "create_pipeline"):
            registered_pipelines[pipeline] = pipeline.create_pipeline()
    registered_pipelines["__default__"] = sum(registered_pipelines.values())    # it's cool we can do this now
    return registered_pipelines

def register_pipelines() -> Dict[str, Pipeline]:
    return get_default_registered_pipelines()

Then, if wanted, a user could change the default behaviour like this:

def register_pipelines():
    defaults = get_default_registered_pipelines()
    defaults["a"] = a1.create_pipeline() + a2.create_pipeline()
    my_other_pipeline_definitions = {"d": a.create_pipeline() + b.create_pipeline()}
    return {**defaults, **my_other_pipeline_definitions}

Questions: Where should get_default_registered_pipelines go? The Zen of Kedro says A sprinkle of magic is better than a spoonful of it, which suggests maybe it goes in itself. But maybe it's confusing for a user to have this weird looking code in such a core user-facing file (like seemed to me when I first saw it)? So maybe better to have it defined on framework-side and then done as import kedro.pipeline... instead?

Alternative implementations

antonymilne commented 2 years ago

On second thoughts I think it's clear that the get_default_registered_pipelines function should go on the framework-side. This would be consistent with moving the _find_run_command etc. functions to the framework side as discussed in #1423.

One other thing I'm wondering though is a more extreme Part 2 of this where we remove the file altogether. This would be consistent with how we have removed the and files from the default template. A user would still be able to create the file themselves if they want to modify the default behaviour (get_default_registered_pipelines). I think there are two possible models here:

  1. if doesn't exist, just use get_default_registered_pipelines. If it exists and has register_pipelines function then use that instead. This is analogous to the behaviour of
  2. put something in like REGISTER_PIPELINES_FUNCTION which defaults to framework-side get_default_registered_pipelines. A user can then override this as they please with a path to their own custom register_pipelines function which could in theory live anywhere. This is analogous to behaviour of, config loader and others
antonymilne commented 2 years ago

Support for this idea in to enable a plugin that does yaml pipeline definitions.

idanov commented 2 years ago

Suggestion number 2 in the previous comment seems most useful, although I would make the default function be the current one in, but you could turn on/off autoregistering the pipleines by changing it to a built-in function for autoregistry. I don't think it is a good idea to remove by default, since it's the entrypoint to the application and most users will look for it, unlike which only advanced users change. So my preferred behaviour would be:

  1. Add a PIPELINE_REGISTRY_FUNCTION, which is <package>.pipeline_registry.register_pipelines by default (i.e. the same as the current behaviour)
  2. Provide a helper function, called autoregister_pipelines, which could be imported and used in to be set as PIPELINE_REGISTRY_FUNCTION, and which will do what your get_default_registered_pipelines is doing and call <package>.pipeline_registry.register_pipelines at the end or...
  3. Alternatively, PIPELINE_REGISTRY_FUNCTION can take an array of functions and will merge their result at the end (with clear overriding order), e.g. people can set it to PIPELINE_REGISTRY_FUNCTION = [ autoregister_pipelines, register_pipelines ]

Number 3 seems very powerful and very simple to implement.

antonymilne commented 2 years ago

Thanks for the comments @idanov. I like your idea 3 a lot.

Building on it, the only thing I wonder is what the default value of PIPELINE_REGISTRY_FUNCTION should be:

  1. PIPELINE_REGISTRY_FUNCTION = register_pipelines (effectively same as now)
  2. PIPELINE_REGISTRY_FUNCTION = autoregister_pipelines (would be breaking change 👎 )
  3. PIPELINE_REGISTRY_FUNCTION = [autoregister_pipelines, register_pipelines] (non-breaking since register_pipelines would overwrite autoregister_pipelines 🎉 )

Following our current model in which a user doesn't need to touch unless they're trying to do something relatively advanced/customised, I would say that ultimately the default value should be the one which is most commonly useful for beginner users. This would be option 2 or 3, since then a beginner user doesn't need to touch or in order to run a simple kedro project (e.g. I could do the whole spaceflights tutorial without needing to touch those files at all).

However, although option 3 is non-breaking, it would be a bit of a departure from current behaviour. So my feeling is probably option 1 is right for now, and we give option 2 and/or 3 as commented-out suggestions in (like we do with TemplatedConfigLoader) etc. Then we can always revisit in the future, depending on user feedback about pipeline autoregistration.

antonymilne commented 2 years ago

On second thoughts, I'm not sure how much I like idea 3... I'm guessing that a common pattern would be:

  1. use autoregister_pipelines to setup pipelines dictionary
  2. customise that dictionary in some way, e.g. to overwrite pipelines["existing_key"] or create a new pipelines["new_key"] that uses already-registered pipelines

On point 2, as in my original example, what I would like to do is something like this:

pipelines["existing_key"] = pipelines["existing_key"].filter(...)
pipelines["new_key"] = pipelines["existing_key_1"] + pipelines["existing_key_2"]

The sequential nature of idea 3 means that this wouldn't be possible unless we let register_pipelines take an input argument of pipelines that it can somehow mutate. I would instead need to call autoregister_pipelines from inside my register_pipelines function to compose the two functions together and then give a single function in PIPELINE_REGISTRY_FUNCTION. So I don't now see how the extra power (and complexity) of idea 3 would actually be helpful in most cases - do you have some particular ideas of when the extra power would be helpful?

antonymilne commented 2 years ago

Notes from technical design on 29 June:


Questions: in the future would we still add the option and/or remove

antonymilne commented 2 years ago

To be implemented in #1664. Following discussion with Ivan, we decided there's no need to add an option for any more.