Closed antonymilne closed 2 years ago
On second thoughts I think it's clear that the get_default_registered_pipelines
function should go on the framework-side. This would be consistent with moving the _find_run_command
etc. functions to the framework side as discussed in #1423.
One other thing I'm wondering though is a more extreme Part 2 of this where we remove the pipeline_registry.py
file altogether. This would be consistent with how we have removed the hooks.py and cli.py files from the default template. A user would still be able to create the file themselves if they want to modify the default behaviour (get_default_registered_pipelines
). I think there are two possible models here:
pipeline_registry.py
doesn't exist, just use get_default_registered_pipelines
. If it exists and has register_pipelines
function then use that instead. This is analogous to the behaviour of cli.pyREGISTER_PIPELINES_FUNCTION
which defaults to framework-side get_default_registered_pipelines
. A user can then override this as they please with a path to their own custom register_pipelines
function which could in theory live anywhere. This is analogous to behaviour of hooks.py, config loader and othersSupport for this idea in https://github.com/kedro-org/kedro/discussions/1436 to enable a plugin that does yaml pipeline definitions.
Suggestion number 2 in the previous comment seems most useful, although I would make the default function be the current one in pipeline_registry.py
, but you could turn on/off autoregistering the pipleines by changing it to a built-in function for autoregistry. I don't think it is a good idea to remove pipeline_registry.py
by default, since it's the entrypoint to the application and most users will look for it, unlike cli.py
which only advanced users change. So my preferred behaviour would be:
PIPELINE_REGISTRY_FUNCTION
, which is <package>.pipeline_registry.register_pipelines
by default (i.e. the same as the current behaviour)autoregister_pipelines
, which could be imported and used in settings.py
to be set as PIPELINE_REGISTRY_FUNCTION
, and which will do what your get_default_registered_pipelines
is doing and call <package>.pipeline_registry.register_pipelines
at the end or...PIPELINE_REGISTRY_FUNCTION
can take an array of functions and will merge their result at the end (with clear overriding order), e.g. people can set it to PIPELINE_REGISTRY_FUNCTION = [ autoregister_pipelines, register_pipelines ]
Number 3 seems very powerful and very simple to implement.
Thanks for the comments @idanov. I like your idea 3 a lot.
Building on it, the only thing I wonder is what the default value of PIPELINE_REGISTRY_FUNCTION
should be:
PIPELINE_REGISTRY_FUNCTION = register_pipelines
(effectively same as now) PIPELINE_REGISTRY_FUNCTION = autoregister_pipelines
(would be breaking change 👎 )PIPELINE_REGISTRY_FUNCTION = [autoregister_pipelines, register_pipelines]
(non-breaking since register_pipelines
would overwrite autoregister_pipelines
🎉 )Following our current model in which a user doesn't need to touch settings.py unless they're trying to do something relatively advanced/customised, I would say that ultimately the default value should be the one which is most commonly useful for beginner users. This would be option 2 or 3, since then a beginner user doesn't need to touch settings.py or pipeline_register.py in order to run a simple kedro project (e.g. I could do the whole spaceflights tutorial without needing to touch those files at all).
However, although option 3 is non-breaking, it would be a bit of a departure from current behaviour. So my feeling is probably option 1 is right for now, and we give option 2 and/or 3 as commented-out suggestions in settings.py (like we do with TemplatedConfigLoader
) etc. Then we can always revisit in the future, depending on user feedback about pipeline autoregistration.
On second thoughts, I'm not sure how much I like idea 3... I'm guessing that a common pattern would be:
autoregister_pipelines
to setup pipelines
dictionarypipelines["existing_key"]
or create a new pipelines["new_key"]
that uses already-registered pipelinesOn point 2, as in my original example, what I would like to do is something like this:
pipelines["existing_key"] = pipelines["existing_key"].filter(...)
pipelines["new_key"] = pipelines["existing_key_1"] + pipelines["existing_key_2"]
The sequential nature of idea 3 means that this wouldn't be possible unless we let register_pipelines
take an input argument of pipelines
that it can somehow mutate. I would instead need to call autoregister_pipelines
from inside my register_pipelines
function to compose the two functions together and then give a single function in PIPELINE_REGISTRY_FUNCTION
. So I don't now see how the extra power (and complexity) of idea 3 would actually be helpful in most cases - do you have some particular ideas of when the extra power would be helpful?
Notes from technical design on 29 June:
autoregister_pipelines
the default then beginner users wouldn't be aware of its existence and would still have to alter code in settings.py
to activate itPIPELINE_REGISTRY_FUNCTION = autoregister_pipelines
by default, which would be a breaking changeregister_pipelines
function, e.g. dynamically generate them. I think we should consider whether register_pipelines
takes some argument like extra_params
(as it used to with hooks) or whatever happens with "metaparameters" as part of config reworking. I have several examples where this would be useful.Conclusion:
autoregister_pipelines
function + editing pipeline_registry.pypipeline_registry.register_pipeline
remains, no new options in settings.pyregister_pipeline
function will, instead of being empty, import and use autoregister_pipelines
autoregister_pipelines
and then modify the dictionary it returns)autoregister_pipelines
by default (modify all starters to use it); existing kedro projects will remain unchanged but people can modify their pipeline_registry
to use autoregister_pipelines
if they likeQuestions: in the future would we still add the settings.py option and/or remove pipeline_registry.py?
To be implemented in #1664. Following discussion with Ivan, we decided there's no need to add an option for settings.py any more.
Following a discussion in backlog grooming, the idea of auto-registering pipelines met with general approval so this is a ticket to design how to do it. See https://github.com/kedro-org/kedro/issues/1078 for original context and motivation.
The end goal
When I do
kedro pipeline create
it creates the following structure:Assuming they're following the above structure, a user should be able to run
kedro run --pipeline=a
_without needing to editpipeline_registry.py
at all_.kedro run
should run all pipelines, i.e. we have__default__ = a + b + c
. It should be possible for a user to overwrite these automatic registrations if they want to by editingpipeline_registry.py
as they can now.Ultimately the above structure should result in a
pipeline_registry.py
that acts like the following (but does not actually have this code):Proposed implementation
Something that is very roughly like this:
Then, if wanted, a user could change the default behaviour like this:
Questions: Where should
get_default_registered_pipelines
go? The Zen of Kedro says A sprinkle of magic is better than a spoonful of it, which suggests maybe it goes in pipeline_registry.py itself. But maybe it's confusing for a user to have this weird looking code in such a core user-facing file (likehooks.py
seemed to me when I first saw it)? So maybe better to have it defined on framework-side and then done asimport kedro.pipeline...
instead?Alternative implementations
kedro.project._ProjectPipelines
that automatically registers pipelines. Sounds a bit too magical to me though - I prefer the explicitness of the above.kedro pipeline create
. Sounds totally horrible though.