kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.94k stars 903 forks source link

Why does kedro seem to avoid accessing run arguments or context in higher level functions? #4104

Closed MarcelBeining closed 2 days ago

MarcelBeining commented 2 months ago

Description

We use kedro pipelines alot for our AI projects and we stumble so often over this problem, that it is time to make an issue about it. We regularly pass arguments like the desired environment as a run argument to kedro run. We also need custom functionalities that we implement into settings.py (e.g. custom hooks) and pipeline_registry.py (e.g. custom pipeline combination). For these functionalities we sometimes need extra information, such as the environment we are running.

There is no simple and robust way to access run arguments in these functions! Possible solutions that have been suggested and tested by us so far:

  1. Use sys.argv ourselves: that seems kind of error-prone if the env is handed over in some other way (e.g. KEDRO_ENV)
  2. Get the env info from session object: that worked until get_current_session() was deprecated in 0.18 and it seems completely impossible now to access session object in deeper functions.
  3. Put an "env" globals variable in each env and use it: That works for using it in catalog but in the config files mentioned above, we would need to reinitialize an OmegaConfigLoader, which requires... guess what: defining the env :-D
  4. Use a hook to intercept during after_context_created, save the env information from there in a global class/variable and use it: That seems very hacky and works only for pipeline_registry.py, not for settings.py as that is called before after_context_created

The same problem one has of course if trying to access any parameter from parameters.yml in these higher level files.

Context

This should be important for anyone, who extends kedro pipeline functionality above its standard use.

Possible Implementation

Simply make it possibly to import and access the kedro context or session object (at least in some frozen, read-only state) from anywhere!

DimedS commented 2 months ago

Hi @MarcelBeining. Thanks for raising this! I think it makes a lot of sense. Would you be interested in submitting a pull request for this? If not, the Kedro maintainers could consider adding it.

MarcelBeining commented 2 months ago

Hi @DimedS. I guess it had some reasons why kedro maintainers designed it as it is now. So before I trial-and-error different implementations until it suits the (to me unknown) design principles, I'd rather suggest the kedro maintainers should add it :-)

noklam commented 2 months ago

Can you explains what arguments do you need? Maybe I don't understand the question, isn't this available in hooks?

https://docs.kedro.org/en/stable/api/kedro.framework.hooks.specs.PipelineSpecs.html#kedro.framework.hooks.specs.PipelineSpecs

MarcelBeining commented 2 months ago

Sure, there are two use cases:

  1. We need parameters from the correct parameter.yml in settings.py to fill in configurable email details (sender, recipient etc.) to an EmailNotifier hook (https://gitlab.com/anacision/kedro-expectations#notification). But in settings.py it is currently not possible to get the correct kedro parameters at it is not possible to find out which environment argument is used for the current run.
  2. Depending on the environment, some pipelines should not be available (i.e. build together in pipeline_registry.py) to avoid executing critical code in production. Here we also would know in pipeline_registry.py what env argument kedro is run with. This is even before "before_pipeline_run" so no possibility to get the env argument from there. And even if, it would be kind of hacky as one would have to use a global variable and needs an extra hook that fills it.
merelcht commented 1 month ago

Sure, there are two use cases:

  1. We need parameters from the correct parameter.yml in settings.py to fill in configurable email details (sender, recipient etc.) to an EmailNotifier hook (https://gitlab.com/anacision/kedro-expectations#notification). But in settings.py it is currently not possible to get the correct kedro parameters at it is not possible to find out which environment argument is used for the current run.

I don't have a clear cut answer on how to fix this, but what you're trying to do here does go against the flow of execution for Kedro. settings.py is used to instantiate all components needed for a functioning Kedro project pre running it. It's not meant to contain knowledge about the runtime variables. The architecture diagram might help illustrate how components are designed to interact with each other: https://docs.kedro.org/en/stable/extend_kedro/architecture_overview.html

  1. Depending on the environment, some pipelines should not be available (i.e. build together in pipeline_registry.py) to avoid executing critical code in production. Here we also would know in pipeline_registry.py what env argument kedro is run with. This is even before "before_pipeline_run" so no possibility to get the env argument from there. And even if, it would be kind of hacky as one would have to use a global variable and needs an extra hook that fills it.

For this second case, can't you use namespaces to filter what pipelines should be executed? https://docs.kedro.org/en/stable/nodes_and_pipelines/namespaces.html

merelcht commented 2 days ago

Closing this due to inactivity. Feel free to re-open this to continue the conversation!