kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.96k stars 903 forks source link

[KED-3023] Enable plugins to extend starter-aliases default list #1040

Closed Galileo-Galilei closed 2 years ago

Galileo-Galilei commented 2 years ago

Description

Is your feature request related to a problem? A clear and concise description of what the problem is: "I'm always frustrated when ..."

We have at work a few kedro starters templates which enable to create kedro projects with a bunch of internal configuration (pre-commit, gitlab-ci, custom ConfigLoader...). We also have a custom plugin which modifies kedro's internal (especially the CONF_ROOT constant and the CLI).

When we distribute our template insisde the organisation, we have to give the entire VCS path to the project to our user, and it is not very convenient. I wish we were able to decalre a new starter-alias in the plugin, and when the plugin is installed and auto-discovered by kedro, users can user:

kedro new --starter=my_starter where my_starter VSC path was declared in the plugin, maybe like this:

_PLUGIN_STARTER_ALIAS: ​{my_starter: path/to/VCS}

Context

Why is this change important to you? How would you use it? How can it benefit other users?

The 2 main benefits I expect are:

Possible Implementation

Add a _PLUGIN_STARTER_ALIAS constant in plugins which will be user to extend the https://github.com/quantumblacklabs/kedro/blob/bce62c824a4786e7cf9cc355b31840879ec851bf/kedro/framework/cli/starters.py#L31-L38 set.

It likely needs a bit more refactoring to make it a dict accepting different _STARTERS_REPO path which is currently a hardcoded path to a single one github repo.

datajoely commented 2 years ago

I like this idea! Let me bring it up with the team

antonymilne commented 2 years ago

I also like this idea but am a little bit unsure about implementation. Thinking out loud here...

As you point out, we need to make it a bit more general than just allowing new starter names to be added, since you will need to accept different _STARTER_REPO paths as well. Since all of the logic to do with different starters is really just working out the right values of directory, template_path and checkout, let's make this as general as possible so that users can specify their own starters however they like rather than just a single custom repo URL. e.g. you could have one starter with alias my_starter-master and another with alias my_starter-develop that points to a different branch or even a different repo.

The default _STARTER_ALIASES would then become something like this. Maybe there's also space for a description or docs that appears when you do kedro starter list?

_STARTER_ALIASES = { 
     "astro-iris": {"template_path": "git+https://github.com/quantumblacklabs/kedro-starters.git", "directory": "astro-iris"}
     "mini-kedro": {"template_path": "git+https://github.com/quantumblacklabs/kedro-starters.git", "directory": "mini-kedro"}
     ...
}

So far so good, but what I don't understand and hopefully @Galileo-Galilei can explain is how the user extends the _STARTERS_ALIASES dictionary through a plugin? Effectively you want to do:

_STARTERS_ALIASES.update({"my_starter": {"template_path": "blahblahblah", directory="blah"}})

... but I don't see how that change propagates through for when you do kedro new. Would you need to overwrite the whole new command on the CLI to take in your modified _STARTERS_ALIASES or is there a nicer way? How are you changing CONF_ROOT from a plugin? Feel like I'm missing something here!

Galileo-Galilei commented 2 years ago

TBH, I haven't thought in details about the exact implementation. Actually I want to do exactly what you describe above (updating the _STARTER_ALIASES dictionary, provded you change the actual implementation to a dictionary).

My naive idea was that you might have a kedro.constants entrypoint: setup(entry_points={"kedro.constants": ["plugin_name = plugin_name.plugin_constants:KEDRO_CONSTANTS"]})

and in a plugin_constants.py:

KEDRO_CONSTANTS={
_STARTER_ALIASES: {"my_starter": {"template_path": "blahblahblah", directory="blah"},
CONF_ROOT=[path/to/conf]
}

Which would enable you to access these constants when loading the entrypoints and override some of the constants.

Honestly, this feels quite hardcoded and not very flexible. Maybe dynaconf has the possibility to override settings.py (in a way similar to kedro environments) and we could create a settings.py file in the plugin with these constants? I am not very familiar with the mechanism so I can't quite tell if it is easy to do.

P.S: Specifically for CONF_ROOT, i think that in general it is more flexible to inject it at runtime through the CLI as suggestedin your user research on configuration, something like kedro run --conf-root=/home/ivan/my_new_conf/

antonymilne commented 2 years ago

My feeling is that things like CONF_ROOT that are in settings.py probably don't belong in an entrypoint. As I understand it, we are deliberately moving from the system of registration hooks (like register_catalog) to instead specify library components in settings.py so that you can't modify these things through a pip install. So I'm not sure it would make sense for a plugin to be able to modify variables like CONF_ROOT.

_STARTER_ALIASES is obviously a bit different though, given that it needs to be specified outside a kedro project. For this I don't immediately see another way to achieve it other than via entrypoints like you suggest.

Galileo-Galilei commented 2 years ago

Hi @AntonyMilneQB, just to clarify, I don't think CONF_ROOT belongs to an entrypoint neither (actually, all project-related constants should be declared in settings.py: plugins can already access/supercharge them if necessary juste by importing them if they want).

The hacky entrypoint solution is only necessery for global constants (like _STARTER_ALIASES) which need to be modified before the project is created. To be honest, I don't see any other use case for now apart from updating this starter dictionnary.

P.S: I really like the settings.py refactoring, I think it is both more secure (not executing random code on install) and more modular (basically, it seems that all the AbstractDatasets and the ConfigLoader could be in separated librairies since they are now completely independant from the rest of the framework)

antonymilne commented 2 years ago

@Galileo-Galilei Cool, thanks very much for your comments. Completely agree with what you say, and pleased you like the new settings.py system 👍 Let's see if we can make extendible starters happen.

lorenabalan commented 2 years ago

Linking here some alternative thinking on starter discoverability before we get tied up to a single solution/idea: https://github.com/quantumblacklabs/kedro-starters/pull/40#pullrequestreview-712884148

Galileo-Galilei commented 2 years ago

@lorenabalan I've read the discussion and I think that browsing github to find kedro-starters among community-developped examples may be a good idea, but comes with some caveats to keep in mind: