kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.91k stars 900 forks source link

Remove kedro.line_magic entrypoint #878

Closed Galileo-Galilei closed 6 months ago

Galileo-Galilei commented 3 years ago

This issue has changed quite a bit since it was originally written. Look at the final post here to see what we now need to do.

Original issue here for posterity.


Description

Is your feature request related to a problem? A clear and concise description of what the problem is: "I'm always frustrated when ..."

I am the author of several kedro plugins either open souce (kedro-mlflow) or at work. It turns out that most plugins either perform one (or several) of the following tasks:

Since Kedro is a data science framework, and since Jupyter is a standard for the exploration phase of a data science project, it seems natural to make Kedro compatible with Jupyter. For now, Kedro offers the following functionnality:

This works well for the core library, but it would be a more pleasant user experience to improve plugin compatibilities with notebooks with the following actions:

Context

Why is this change important to you? How would you use it? How can it benefit other users?

It is a common pattern to "setup" your plugin, e.g. instantiate a connection to an external service (kedro-mlflow creates a connection to the mlflow server, I do have a sas plugin which creates a connection to the database...), especially in ther "before pipeline_run" hook.

If I create a custom line_magic (say %reload_kedro_mlflow), it is discovered by kedro and it is accessible inside the notebook or the ipython session. However, I'd like to run it automatically when the notebook is opened via the Kedro CLI instead of forcing the user to make it manually.

Possible Implementation

(Optional) Suggest an idea for implementing the addition or change.

Possible Alternatives

(Optional) Describe any alternative solutions or features you've considered.

It is currently possible to bypass Kedro configuration by creating a load_ipython_extension function to execute code when opening the notebook (like here: https://github.com/quantumblacklabs/kedro/blob/fc9d1c5d35981c51a651af467e79e93df065d354/kedro/extras/extensions/ipython.py#L121). However, we cannot access the objects created by reload_kedro (especially the session or the catalog if we want to modify /use them). We need to duplicate almost all the code to make it works, and it prevents combining several plugins.

antonymilne commented 3 years ago

Thanks for this very clear write up! I'm wondering how this would fit into current ideas around improving the jupyter/ipython workflow (as per 0.17.5 release notes, we have removed .ipython/profile_default/startup/ from the Kedro project template in favour of .ipython/profile_default/ipython_config.py and kedro.extras.extensions.ipython). @idanov

add documentation on how to create a custom line magic in your plugin, because this feature is undocumented yet. I had to read the code to discover it and I struggled a little to make it work². I can make a PR, maybe by adding a sub-section in the plugin section?

This is an excellent idea I think. As far as I can see the kedro.line_magic entry point is indeed completely undocumented and would fit in well on that page. I've actually been meaning to make that page a bit clearer for a while, but if you did want to add a subsection for how to write custom line magic then that would be awesome.

add the possibility to execute some functions / push global variables to the notebook session when it is openened via a Kedro CLI command

Just to understand, this would effectively give users the ability to add their own custom code that would be executed upon running %reload_kedro?

Galileo-Galilei commented 3 years ago

Yes, I've seen how you refactored the ipython part for the next release and it seems much simpler like this.

Regarding the documentation, I agree that apart from the line_magic, the plugin section is a bit sparse and may need more details for beginners (I often find myself digging in the code to see how things are implement under the hood, or look at kedro-viz as an example). I'd try to suggest some improvements when I have time but this is not a top priority for me to be honest.

Just to understand, this would effectively give users the ability to add their own custom code that would be executed upon running %reload_kedro?

Yes. The current workflow for a user is:

  1. Open the notebook with kedro jupyter notebook
  2. Call %reload_kedro_mlflow inside a cell
  3. (Optional) If modifications are made to the kedro objects / configuration (catalog, pipelines, mlflow.yml config) and the user needs to reload them, call:
    %reload_kedro
    %reload_kedro_mlflow

I wish it were:

  1. Open the notebook with kedro jupyter notebook
  2. (Optional) If modifications are made to the kedro objects / configuration (catalog, pipelines, mlflow.yml config) and the user needs to reload them, call %reload_kedro inside a cell

It is honestly not a big deal and the first workflow is quite satisfying, I just think that it may be more user friendly to hide the "kedro-mlflow" part from the user.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

antonymilne commented 2 years ago

Somehow it's nearly a year now since this was posted, but I have some fresh thoughts on it now! For context, we've been working on the kedro-viz %run_viz line magic recently (https://github.com/kedro-org/kedro-viz/pull/1012). Now that I understand more about how these line magics work I've realised that our current setup is not great.

tl;dr: let's go for the proposal under Possible Alternatives above and remove the kedro.line_magic entrypoint altogether. It's very weak and doesn't add anything useful since we can achieve everything we want just through %load_ext already.

Background

So let's remove the entrypoint!

If we can agree a good plan with @Galileo-Galilei then I would even be happy to do this as a non-breaking release, after doing a user survey to double check that no one else is using it.

What would replace it?

Here's the requirements:

  1. essential: ability to add custom behaviour into a notebook, e.g. to register line magics, push variables to scope, etc.
  2. nice to have: such code would automatically run when %load_ext kedro is run (or maybe when reload_kedro is run, not sure)

Essential (requirement 1)

As far as I can tell, satisfying requirement 1 is trivial because this is exactly what's already provided by IPython extensions! If you write an IPython extension then it can then do whatever you like - no need to be constrained like you currently are to just writing line magics with local_ns in their signature.

The user flow in a managed jupyter instance would then be:

%load_ext kedro
only needed if you're not in the project directory or want to change env etc.: %reload_kedro <path>
%load_ext kedro_mlflow
%load_ext kedro_viz

This is basically the Possible Alternatives proposal from @Galileo-Galilei above. The crucial thing is that I believe this statement is not correct:

However, we cannot access the objects created by reload_kedro (especially the session or the catalog if we want to modify /use them). We need to duplicate almost all the code to make it works, and it prevents combining several plugins.

So long as %reload_kedro has run (which is called by %load_ext kedro) then these variables are exactly what's available inside local_ns. i.e. you can do local_ns["catalog"] inside your kedro-mlflow line magic. Probably you don't even need local_ns and could get these out of get_ipython() somewhere.

Nice to have (requirement 2)

If we wanted to, this could be achieved by introducing a new, more general IPython entrypoint that runs when %load_ext or %reload_kedro is run. Or probably better, as in the original proposal, a hook. Technically this doesn't need to take any arguments since you can always do get_ipython, but probably for convenience we would expose catalog etc.

However, introducing this hook feels like a big addition for what would be a small increase in convenience for a small and infrequently used feature (basically just @Galileo-Galilei + kedro-viz). Given that all it does it remove the need for a user to type one extra line of code (%load_ext ...) I think it's not worth doing. The above solution of each plugin handling its own IPython extension feels clean and not having kedro automatically loading them also feels clean to me. If there's more call for adding this as a convenience in future then we could always add it, but honestly it feels very unlikely that many people would want it.

Please comment!

From the kedro side, I think the only possible objection would be that the flow for doing %run_viz would now require you to do %load_ext kedro_viz first. Personally I think this is totally fine. Maybe %load_ext kedro_viz would even execute %run_viz automatically (like %load_ext kedro runs %reload_kedro).

We should also check that this doesn't break everything. i.e. if someone upgraded to a version of kedro without this entrypoint but still had things registered to it, would everything crash horribly? What happens if you pip install kedro-viz and the entrypoint isn't defined?

@Galileo-Galilei what do you think of this all? As above, if it's a plan you are happy with then I think we should just do it after checking that no one else cares in a user survey.

antonymilne commented 2 years ago

I should also clarify that the above local_ns trick does work in terms of hot-reloading the variables, i.e. it's not stuck on the initial values of these variables populated when %reload_kedro first executes.

Galileo-Galilei commented 2 years ago

Thank you for the very detailed answer!

If we can agree a good plan with @Galileo-Galilei then I would even be happy to do this as a non-breaking release, after doing a user survey to double check that no one else is using it.

Completely happy with it, and not considering it as a breaking change:

This is basically the Possible Alternatives proposal from @Galileo-Galilei above

Glad to go with this option!

So long as %reload_kedro has run (which is called by %load_ext kedro) then these variables are exactly what's available inside local_ns. i.e. you can do local_ns["catalog"] inside your kedro-mlflow line magic. Probably you don't even need local_ns and could get these out of get_ipython() somewhere.

I figured this out muuuuuuuuch longer after I wrote the original post :sweat_smile:, but thank you for the indication!

However, introducing this hook feels like a big addition for what would be a small increase in convenience for a small and infrequently used feature

Completly agree on this. The feature seems too little used to be worth the engineering effort, especially since your "Requirement 1" would already be a very convenient setup.

We should also check that this doesn't break everything. i.e. if someone upgraded to a version of kedro without this entrypoint but still had things registered to it, would everything crash horribly? What happens if you pip install kedro-viz and the entrypoint isn't defined?

Indeed, it would be a pity to have a lot of users complaining because you removed something unused !

@Galileo-Galilei what do you think of this all? As above, if it's a plan you are happy with then I think we should just do it after checking that no one else cares in a user survey.

Let's go! :rocket:

antonymilne commented 2 years ago

Awesome, thanks for the confirmation @Galileo-Galilei. Looking at the code again actually there's no way this can break anything since Kedro actually doesn't expose particular entrypoint that are available for use; it just tries to load things up using kedro.framework.cli.utils.load_entry_point. Hence if we don't look for the line_magic entrypoint at all then it doesn't matter if there's things registered there. They'll just be ignored and not break anything.

The actions here are:

Initial checks

Implement it!

antonymilne commented 2 years ago

Notes from technical design session on 31 August:

noklam commented 2 years ago

I am slightly in favor of loading %run_viz automatically.

Existing workflows

  1. kedro ipython/jupyter -> %reload_kedro (If you are in local)
  2. %load_ext kedro -> %reload_kedro (If you are in JupyterHub/Databricks)

I would expect 1 is a more common case, so I think most people would only interact with %reload_kedro, and therefore they should interact with %run_viz equivalently. We could also add more line magic in the future (maybe a line magic to save a static output or something related to experiment tracking?), this makes me less prefer overloading the load_ext too much.

Overall I agree the existing implementation is incorrect (arguably it's a bug, just that no one is using it), but I would go for a lower effort option which may just be fixing the local_scope, fixing the line magic to accept the correct arguments (if we haven't already fixed it?)

antonymilne commented 2 years ago

@noklam I definitely see the argument for %load_ext kedro_viz.ipython not running kedro viz automatically. Note this idea is not so different from how the kedro IPython extension works though:

Compare to possible kedro-viz IPython workflow:

I don't feel strongly about %load_ext kedro_viz.ipython automatically running kedro-viz though - it does also feel a little unusual to me also.

What I think is more important is the existence of the line_magic entrypoint, and the more I think about this the more I still think we should remove it. Fixing this line_magic entrypoint needs a bit more than just moving the needs_local_scope I think:

None of these are hard things to do but I think the effort is comparable to just removing it. Adding the on_ipython_loaded hook would also be very easy and a much better solution if we do still want %run_viz to be registered automatically (i.e. without needing to do %load_ext kedro_viz.ipython).

One advantage of an explicit %load_ext kedro_viz.ipython that I forgot to mention before is that it would enable %unload_ext kedro_viz.ipython, which could be used to kill kedro-viz instances. This is something we don't currently have a way to do and would otherwise require a new line magic %kill_viz or something. The load/unload mechanism feels much nicer though.


Let me try and paint the picture in a different way: if this line_magic entrypoint didn't already exist in kedro, would we be trying to add it? I think very unlikely. It would be a non-trivial increase in the API surface for a small convenience for one particular use case that is not common (one less line to type if you want to launch kedro-viz from Jupyter). Given that the effort to remove it is about the same as that required to fix it, and removing it would be de-facto non-breaking, I still don't really see any point in keeping it.

It's such a minor thing though that I'm also happy just to leave it as it is rather than removing or fixing it... 😀