kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.91k stars 900 forks source link

Remove logic to activate and deactivate `KedroSession` #1431

Closed merelcht closed 2 years ago

merelcht commented 2 years ago

KedroSession is still keeping a global for the currently active session, which causes problems when trying to create a session in a notebook with the Kedro IPython extension and makes it unable to create/run the session. Here's the functions we need to drop: https://github.com/kedro-org/kedro/blob/main/kedro/framework/session/session.py#L34-L47

And here is where they are used: https://github.com/kedro-org/kedro/blob/main/kedro/framework/session/session.py#L274-L287

Need to add a regression test:

antonymilne commented 2 years ago

I think we should take care here to understand exactly why this code exists in the first place and what would be the consequences of making this change, i.e. what will behave differently after removing this code. Presumably it was originally put there for a good reason to handle some particular cases. Maybe such cases are no longer possible now we have one run session, or maybe they're so marginal that we don't care about them. But it would be good to understand the context of why this logic is here currently.

noklam commented 2 years ago

@AntonyMilneQB

I checked the repository the _activate_session() and the global _active_session are only used by ipython.py and the KedroSession's context manager. It was also used by get_current_session() but it no longer exists now.

The activate/deactivate session mechanism doesn't look necessary to me, in fact, if you just do session.run() today it runs perfectly fine. This whole activate/deactivate thing only stops you if you are using context manager.

This will run

# Assume activate session exists but not closed
session.run() # this will run

with KedroSession.create('dummy') as new_session:
    new_session.run() # this will fail due to existence of unclosed session.

However, there is a test_nested_sessions test expecting this to fail. https://github.com/kedro-org/kedro/blob/307783b7f48a6d986d0f9981077f71ebd1775429/tests/framework/session/test_session.py#L521-L532

It was created since Ipython extension exists. [KED-1996] Create an IPython extension for Kedro (#853) ยท kedro-org/kedro@80e2202 ยท GitHub](https://github.com/kedro-org/kedro/commit/80e220242b434af7c42c1c8b19e54675b3091013)

Galileo-Galilei commented 2 years ago

I beg you to not make this change (i.e. not merge #1434) this before you manage to provide a solution for #1411.

Many plugins (including steel-toes, kedro-wings, kedro-neptune and kedro-mlflow(e.g. here, here and here) needs to access the config_loader inside hooks (either to access credentials or a proprietary config file). This is described with many details in issue #506.

The key idea is that we theoretically should be able to do the following :

import settings
config_loader=settings.CONFIG_LOADER_CLASS(*settings.CONFIG_LOADER_ARGS)

inside a hook to retrieve some configuration. However, we have no way to retrieve the environment (--env) used, which is a key argument to retrieve the right configuration.

Notice that none of the workaround described in #506 works with 0.18. The discussion in #1411 shows that we could be able to create a custom context which would export the variable as an environment variable, and then get it back when needed but this looks that an horrible solution and I would hate having to force kedro-mlflow users use a complicated and unintuitive setup for doing simple things which currently works automatically. It really feels like a regression since it currently works "as is".

I hope I was convincing :)

P.S: Obviously, when #1411 will be solved natively in the core library, this would make sense to get rid of this global variable.

noklam commented 2 years ago

@Galileo-Galilei A quick question before I finished reading all the issues. The problem is you no longer have access to session/context/config loader, which is probably already a problem since the removal of get_current_session(). What are the current workaround you are using with kedro-mlflow since I can see it supports kedro==0.18.x.

Galileo-Galilei commented 2 years ago

Hi, I currently recreate the get_current_session function at the plugin level by importing the global variable. This is a terrible workaround but it has a low migration impact. If you have a better solution, I'd be glad to use it.

antonymilne commented 2 years ago

@Galileo-Galilei Thanks very much for your comments here. I for one am convinced by your plea! I thought I had a workaround here since env is available in run_params in the before_pipeline_run hook, and so can be used to register an instance of the config_loader that can be used in any subsequent hooks:

class Hooks:
    @hook_impl
    def before_pipeline_run(run_params: Dict[str, Any]):
        config_loader_class = settings.CONFIG_LOADER_CLASS
        return config_loader_class(
            conf_source=str(PROJECT_PATH / settings.CONF_SOURCE),
            env=run_params["env"],
            runtime_params=run_params,
            **settings.CONFIG_LOADER_ARGS,
        )   

... but I'm not immediately sure where we can get PROJECT_PATH.

I've reopened https://github.com/kedro-org/kedro/issues/506. Sorry your comments on this were not noticed before (for future reference, I think many people don't notice Github notifications - if you really need something then probably best to send a message on discord so it doesn't get overlooked).

Ultimately, as I understand it, the fundamental problem here is that there has never been a good way to access the config_loader instance. Is that right? Currently it's possible to hack something together through accessing the current session.

Just to add some further context here, let me copy a discussion we had when the code around config_loader was refactored prior to 0.18. Unfortunately this was on an internal repo before we had joined the LF (link for people who do have access).

What we should deprecate though, would be context.config_loader as the component should be loaded from settings directly (in the KedroSession code and anywhere else we might be using it). A question I had around that was for the case when that property was a useful thing to work with in a jupyter notebook. settings.CONFIG_LOADER_CLASS won't give them the same as context.config_loader, so idk if there should be: a) a method on the KedroSession object (and call session.get_config_loader() or b) somewhere else (which they can import) c) some magic that makes it importable as an instantiated class from kedro.framework.project d) nothing, people should build it manually e) something else I'm missing

There was some user research conducted around whether anyone used context.config_loader and it was found that no one did and instead, if required, users could make a new instance themselves through calling config_loader = settings.CONFIG_LOADER_CLASS(*settings.CONFIG_LOADER_ARGS) like you suggest. i.e. option d from the above list.

However, it seems that the very important case of plugin authors was not considered here. I wish we had seen https://github.com/kedro-org/kedro/issues/506 at the time, because that makes it clear that it would be valuable to have the config_loader instance available (rather than just the class and args given in settings.py, which is clearly not sufficient since it doesn't give env).

Do you think one of the other options (a-c) makes sense here? If access to the session is removed, option a would of course not be possible.

noklam commented 2 years ago

In kedro-mlflow, there are 2 places where session is needed currently. Case 2 will be addressed by providing config_loader, Case 1 will probably need the after_catalog_created hook to retrieve the conf_cred instead.

PROJECT_PATH is only stored inside session's store I am afraid.

1.

def _export_credentials(self, session: KedroSession = None):
        session = session or _get_current_session()
        context = session.load_context()
        conf_creds = context._get_config_credentials()
        mlflow_creds = conf_creds.get(self.server.credentials, {})
        for key, value in mlflow_creds.items():
            os.environ[key] = value

2.

def get_mlflow_config(session: Optional[KedroSession] = None):
    session = session or _get_current_session()
    context = session.load_context()
    try:
        conf_mlflow_yml = context._config_loader.get("mlflow*", "mlflow*/**")
    except MissingConfigException:
        raise KedroMlflowConfigError(
            "No 'mlflow.yml' config file found in environment. Use ``kedro mlflow init`` command in CLI to create a default config file."
        )
    conf_mlflow_yml["project_path"] = context.project_path
    mlflow_config = KedroMlflowConfig.parse_obj(conf_mlflow_yml)
    return mlflow_config
Galileo-Galilei commented 2 years ago

Thank you both for your answers and the time you dedicated to find out a solution for this.

Comments on your answers

@AntonyMilneQB wrote: I thought I had a workaround here since env is available in run_params in the ``before_pipeline_run hook, and so can be used to register an instance of the config_loader that can be used in any subsequent hooks:

I actually found the exact same things and I was about to write that it it seems to work, since project_path is available in run_params

This solves the problem for before_pipeline_run and after_pipeline_run hooks, but not for the other hooks.

@AntonyMilneQB wrote: Ultimately, as I understand it, the fundamental problem here is that there has never been a good way to access the config_loader instance. Is that right? Currently it's possible to hack something together through accessing the current session.

I almost agree. Accessing the ConfigLoader instance is the only use case I am aware of which justifies retrieving the KedroSession after its instantiation. Actually it is the original title of #506 ๐Ÿ˜… That said, and given noklam comments below yours, we may want to retrieve credentials too. there is currently a possibility to do it with the context._get_config_credentials() utility, but we can do this directly through the config loader so this is not a big deal to not have a specific management for credentials.

Functional use cases

The 2 identified use cases where the context is needed are summarized by @noklam above:

I think the problem depends more on where we want to access these informations rather than what we want to do with them:

  1. It seems quite easy to propagate the env and project_path variables to all hooks, but it requires to change their signature which is a breaking change :boom:
  2. If we want 1, maybe it is juste easier to propagate the config loader in all hooks specs. I don't really what is the benefit of recreating it in the hooks. Unfortunately it requires to change hooks signature which is a breaking change :boom:
  3. If we want to avoid breaking changes, a workaround would be to make user store the config_loader instance in the before_pipeline_run hook as an attribute (provided we can access the config loader in before_pipeline_run) to reuse it later:

    class MyHook():
    @hook_impl
    def before_pipeline_run(
        self, run_params: Dict[str, Any], pipeline: Pipeline, catalog: DataCatalog
    ) -> None:
    
        self._config_loader= settings.CONFIG_LOADER_CLASS(
            conf_source=(Path(run_params["project_path"]) / settings.CONF_SOURCE).as_posix(),
            env=run_params["env"],
            runtime_params=run_params,
            **settings.CONFIG_LOADER_ARGS,
        )   
    @hook_impl
    def before_node_run(
        self, node: Node, catalog: DataCatalog, inputs: Dict[str, Any], is_async: bool
    ) -> None:
       self._config_loader.get("whatever") # self._config_loader is accessible

What we should deprecate though, would be context.config_loader as the component should be loaded from settings directly (in the KedroSession code and anywhere else we might be using it). A question I had around that was for the case when that property was a useful thing to work with in a jupyter notebook. settings.CONFIG_LOADER_CLASS won't give them the same as context.config_loader, so idk if there should be: a) a method on the KedroSession object (and call session.get_config_loader() or b) somewhere else (which they can import) c) some magic that makes it importable as an instantiated class from kedro.framework.project d) nothing, people should build it manually e) something else I'm missing

Finally, I think that:

What to do

I will give a try in kedro-mlflow to solution 3. While writing this answer I just found out that it may already works as is ๐Ÿ˜…

antonymilne commented 2 years ago

Thanks very much @Galileo-Galilei, that all makes a lot of sense and is very useful. What you're trying to do in MyHooks by saving the self._config_loader so that it's accessible in all subsequent hooks is actually exactly what I meant to write in my suggestion above, just I accidentally put return instead ๐Ÿ˜…

I think this should work well actually (especially because you found project_path is in fact available also). This general pattern is something I've often suggested when the something you want isn't available in the hook specification but was available in a previously executed hook:

class Hooks:
    @hook_impl
    def before_pipeline_run(self, something):
        self._something = something

    @hook_impl
    def before_node_run(self): 
        do_stuff(self._something)

I'm not sure how much I like this, but it does work well. On the one hand, doing it a lot probably indicates that we're not making the right things available in the right hooks; but it does mean that we don't have to pass something to every single hook, so the hook specs don't become too bloated.

I still regard this as a bit of a workaround because requiring the user/plugin author to do settings.CONFIG_LOADER_CLASS(...) just doesn't look that nice to me compared to just making the config_loader instance available.

I left a couple of questions in https://github.com/kedro-org/kedro/issues/506 about what we should actually make available in the hooks ๐Ÿ™‚

P.S. unless I'm missing something, adding an argument to a hook specification shouldn't be a breaking change since I believe you need to always write hook_impl using keyword rather than positional arguments?