kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.53k stars 877 forks source link

Provide a lightweight solution to speed up session reload or create new session #2879

Open noklam opened 11 months ago

noklam commented 11 months ago

Quotes

Carlos Barreto We are using Kedro as part of an event stream + Amazon ECS solution. What they want to check is if there is a way to always have the Kedro context up and running having an API call to execute the pipeline only when necessary. I was thinking that this is possible by programmatically generating the KedroContext, making it a global service, and only using specific pipeline calls. But I don’t know if we have any similar use cases implemented already, and I wanted to get some opinions on it. Today, we runs something like a kedro run inside the container, every time, and this ends up spending important warm-up seconds loading the context/dependencies into memory.

Description

As I have many development work with IPython or Jupyter, often I want to make small changes to test if it works. %reload_kedro could be quite slow and the developing experience is frustrating because for every change .

This also potentially related to #1853, #2134, #2182

kedro ipython take > 20s to start and %reload_kedro takes

Context

After this PR, session can only be run once. The easiest way to create a new session is %reload_kedro. While %reload_kedro works, it is considerably slow with big project for a few reasons:

INFO Registered line magic init.py:115 'run_viz'

What's the minimal effort to recreate session?

If we look into the code, there is a self._run_called attribute and everytime we do session.run it will check if it is True. https://github.com/kedro-org/kedro/blob/6913acdfd55898f956b6d91fc4602fbdb011a5d1/kedro/framework/session/session.py#L434-L438

https://github.com/kedro-org/kedro/blob/6913acdfd55898f956b6d91fc4602fbdb011a5d1/kedro/framework/session/session.py#L366-L371

Why do we need this check? Mainly because of session_id need to be a unique value, otherwise it can cause error in experiment tracking (kedro-viz) because it need to be a unique id. If we simply override session._run_called = False and do session.run(), almost everything will work.

Experiment-tracking is not a core feature of kedro (but kedro-viz), is there other obivous reason that we need to protect session_id from running twice?

(edited) It could be related to the timestamp for saving versioned data. However, it's unclear to me because catalog get save_version from session_id, but there is another function that you can find in most dataset implementation.

save_version = self.resolve_save_version()

Possible Implementation

Source: https://github.com/kedro-org/kedro/issues/1551#issue-1239180609

(Bonus) - KedroSession.reset() to create a new session easily? - this can potentially make the Jupyter workflow nicer. Instead of asking user to create their session with lots of details, they can just take the global session and do session.reset() https://github.com/kedro-org/kedro/pull/1571

Maybe implement a session.clear(), session.reset() method

Possible Alternatives

noklam commented 10 months ago

Muhammed Afnas 12:03 PM hi everyone, can we initiate multiple sessions in kedro? if yes, could anyone help me with it? kedro version - 0.18 i am building a web application where in i have to trigger the different pipelines of a kedro project based on button clicks on the dash ui. as of now, individually it is working, but when one session is running, if i tries to trigger another session it gives a runtime error.

astrojuanlu commented 10 months ago

Experiment-tracking is not a core feature of kedro (but kedro-viz), is there other obivous reason that we need to protect session_id from running twice?

I recall there's some issue about session_id that @datajoely identified in his research. Maybe it's related?

noklam commented 10 months ago

That's more related to orchestration and it requires a way to pass a unique identifier when the run is spread to multiple KedroSession

datajoely commented 10 months ago

session_id is used for versioning too which is why it needs to be alphabetically sortable

Arguably if we kept a private session_id and exposed a parameterisable one that would be sufficient

astrojuanlu commented 10 months ago

Uh, we're sorting by session_id? Maybe we should store the datetime instead, but this might be a bit of a digression.

datajoely commented 10 months ago

The session_id was the Versioning ID way back when - @merelcht @idanov can provide more context here

astrojuanlu commented 5 months ago

Moving this to the Session milestone.