Open Galileo-Galilei opened 6 months ago
This issue, among others, was mentioned on 2024-02-14 tech design, in which @Galileo-Galilei and @takikadiri showed kedro-boot https://github.com/takikadiri/kedro-boot/ there was agreement that implementing this, plus making the
Session
re-entrant and more lightweight #2182 #2879, would be good.
Originally posted by @astrojuanlu in https://github.com/kedro-org/kedro/issues/2169#issuecomment-1945946338
Introduction
Overview
:memo: Document type The document is mostly a "User Research" which focuses what functional needs should be added to kedro, but it also sometimes slips to a "Design document" which suggests how they could be integrated to the core kedro framework. The 2 parts are clearly separated, because I think the user research (mainly a compilation of existing kedro issues) is worth sharing, even if we do not end up with the proposed implementation.
:busts_in_silhouette: Target Audience : Kedro core team / plugin developers / Mlops engineer who puts kedro in production. I try to keep the issue as self-contained as possible, but I still assume the reader knows the default kedro objects (runner,pipeline, catalog...), and how
KedroSession.run
works under the hood.:pray: Credits : The mind behind part / most of the design and the thoughts described hereafter is @takikadiri. I mostly reformulate, clarify and try to give a comprehensive overview of the issues we are trying to solve and how we solve them.
:books: Additional resources: Most of the features described hereafter are implemented in the kedro-boot plugin, however the technical implementation sometimes differ for subtle reasons. You can find examples on how to use it for the features described in the "User research" section in the kedro-boot examples repo
:bangbang: Important note : The
kedro-boot
plugin also provides other features, especially to launch an app from a kedro entrypoint (called standalone mode in this comment). This is out of scope of this issue.TL;DR
We need to make the session being runnable multiple times, and optimize latency to serve lots of uses cases (serving, dynamic pipelines, ...). In summary, we should make below pseudo code possible :
There are a bunch of technical optimisations (for speed,
kedro-viz
compatibility and ability to inject parameters truly at runtime) which are needed under the hood.User research : embedded deployment patterns
Functional need 1 : Triggering kedro from third party applications which owns the entrypoint
There is a very common "deployment pattern" which consists to run the
KedroSession
programmatically in another python program. It consists in running (more or less) the following snippet:Overall, this is well described by the following issue: https://github.com/kedro-org/kedro/issues/2169
Multiple use cases are well identified:
session.run()
) instead of trying to "map" kedro objects (e.g. nodes or namespaced pipelines) to airflow tasks automatically.:white_check_mark: Above "naive" code is valid in
kedro>=0.18
. So... is that already ok? Clearly not, because of the next paragraph.Functional need 2 : Passing data at runtime and getting the results
This long-demanded feature is described in details in https://github.com/kedro-org/kedro/issues/2169 so I'll try not to duplicate it here. The point is that almost all use cases described in previous paragraph need to inject "some data" at runtime, e.g. :
and retrieve the results to in memory, e.g. do something like this:
Without the ability to pass data at runtime, all use cases presented above are not feasible so
SNIPPET 1
is hardly useful. We need to find a way to circumvent the current limitations to pass data at runtime.Injecting data & parameters
This is what is covered https://github.com/kedro-org/kedro/issues/2169.
The current workaround consists in rewriting the
KedroSession.run
method in your app with many private methods which causes a lot of maitenance issues, because it becomes really hard to upgrade yout kedro version. We'll cover it in a next paragraph.Injecting globals & runtime_params
A couple of issues suggest that users do not want to override the full data but only pass some
globals
orruntime_params
to be resolved in the catalog : https://github.com/kedro-org/kedro/issues/1723. The typical uses cases consist in using the catalog to :There are 2 big problems in the current implementation :
extra_params
(note that name is inconsistent, it is calledparams
in the CLI,extra_params
in the session andruntime_params
for the resolver, I strongly suggest we normalize the name which is confusing) key exist, but is currently aKedroSession.create
argument instead ofKedroSession.run
arguments. This means that we cannot override these extra params on each run. This tightly couples the 2 methodscreate
andrun
. It currently makes sense because Kedro makes the strong assumption that 1 session = 1 run and even raises an exception (https://github.com/kedro-org/kedro/blob/2e64459a021bd22d79bd322d9bb87ea22f30c5f2/kedro/framework/session/session.py#L324-L329) if a user attemps to run it multiple times, but we can see that it is a strong limitation that is discussed below.KedroSession.run
method access the catalog throughcontext.catalog
attribute. However, the catalog is already resolved at this step and it is no longer possible to inject globals without rebuilding the entire catalog manually (inlcuding rewriting all the load/merge logic between environments ...). The issue #2973 describes in details what the problem is and suggests a potential a solution. There is a more general issue about decoupling loading and resolving configuration https://github.com/kedro-org/kedro/issues/2481.kedro-boot
attempts to solve this issue by enabling a new type of parametrize query in the catalog with the[[ ]]
syntax instead of the jinja``${}
syntax, but this feels a very bad workaround not sustainable on the long run. We'll do better with a custom resover in a next version.Technical requirements 1 : Speed of execution
Above "business" use cases (especially API serving) requires speed of execution, hence any overhead induced by kedro negatively impact their feasibility. A couple of seconds of overhead is acceptable on a 5 mn batch, but less on an API serving preidction swith high latency (say <100ms). There are at least 3 known performance issues of the
KedroSession
:session.load_context()
which is slow. The goal is to provide the save and load version by usingcontext._get_catalog(...)
, as described in https://github.com/kedro-org/kedro/blob/2e64459a021bd22d79bd322d9bb87ea22f30c5f2/kedro/framework/session/session.py#L334. The aforementioned issue makes clear why it is done as is currently, but this feels a high cost for very low benefit. If we can avoid callingload_context
, it will speed up the session run.Mlflow
call such datasets "artifacts", see https://github.com/Galileo-Galilei/kedro-mlflow/blob/e0033c5072c929a4c26cfaeaf61fcedf93d36522/kedro_mlflow/mlflow/kedro_pipeline_model.py#L174-L219.Technical sub-requirements 2 : Running multiple times a single session
If we summarize the last paragraphs, it boils down that we need to make the session runnable mulitples times without the need to create it for each run which creates a tight coupling between the 2 methods. We show that it would enable:
Kedrosession.run
extra_params
toKedrosession.run
instead ofKedroSession.create
runtime_params
(or globals) toKedrosession.run
and resolve the catalog on each run instead on Session instantiationHowever, there are deep investigations to make before enabling this, because kedro makes the assumption that 1 session = 1 run on purpose. According to the issue, this is done to simplify kedro-viz experiment tracking functionality. Changing this assumption may be impossible to be backward compatible with existing experiment tracking in kedro-viz, as described in https://github.com/kedro-org/kedro/issues/1273. However this potential breaking change should be thought in regards of https://github.com/kedro-org/kedro-viz/issues/1624, which suggests low adoption of the experiment tracking functionality, and my feeling is that enabling running multiple sessions is worth breaking experiment tracking, but I am overly biased here :)
A positive side effect of getting rid of this assumption is that it would solve the original motivation of https://github.com/kedro-org/kedro/issues/1273, e.g. letting people customize the session run_id, which is particluarly useful when deploying a kedro pipeline to an orchestrator which assumes task independency (like airflow).
Technical requirements 2 : Ease of use
Database connection lazy loading
One current issue people are facing when using the session, particularly in interactive mode (but not only), is that database connection & API's calls are instantiated on
KedroSession.create
, which means that you cannot run your pipeline if another pipeline is not correcly instantiated. I think this would simplify debugging and ease of use if this issue was tackled https://github.com/kedro-org/kedro/issues/2829.In general, we should be able to run a pipeline from a session even if other pipelines are broken (no import error, invalid data connection...). but this is a hard problem and not the prioritary one.
Create a public API for consistent access to kedro objects between versions
The lack of public API makes code supporting above use cases become obsolete very fast as kedro versions are changing, because it relies ont a lot of private methods. This severly degrades the developer experience for kedro users :
There is a long standing issue to help make consistent access to kedro objects for plugin developers https://github.com/kedro-org/kedro/issues/779, and creating a public API to support such use cases is a good step forward.