Investigate how session/context should be re-designed to work well for API use-cases

datajoely commented 1 year ago

Discussed in https://github.com/kedro-org/kedro/discussions/2134

^{Originally posted by **illia-shkroba** December 16, 2022} Hello. I'm trying to build RestAPI with FastAPI that runs Kedro Pipeline under the hood and come up with this solution: ```python import pathlib from typing import Any, Iterable from fastapi import Depends, FastAPI from kedro.framework.context import KedroContext from kedro.framework.session import KedroSession from kedro.framework.startup import bootstrap_project app = FastAPI( title="FastAPI + Kedro", version="0.0.1", license_info={ "name": "GNU GENERAL PUBLIC LICENSE", "url": "https://www.gnu.org/licenses/gpl-3.0.html", }, ) def get_session() -> Iterable[KedroSession]: bootstrap_project(pathlib.Path().cwd()) with KedroSession.create() as session: yield session def get_context(session: KedroSession = Depends(get_session)) -> Iterable[KedroContext]: yield session.load_context() @app.get("/") def index( session: KedroSession = Depends(get_session), context: KedroContext = Depends(get_context), ) -> dict[str, Any]: session.run("math") catalog = context.catalog return catalog.load("output") ``` `session.run("math")` runs a simple pipeline that calculates a variance for the input: `[1, 2, 3]`. The solution seems to work as expected, but it takes nearly 2.1 seconds to finish a request: ```sh time curl http://127.0.0.1:8000 # curl http://127.0.0.1:8000 0.00s user 0.00s system 0% cpu 2.097 total ``` I've noticed that `session.load_context()` takes about 1 second to finish. Also I've found that `load_context()` is used by `session.run()`: ```python session_id = self.store["session_id"] save_version = session_id extra_params = self.store.get("extra_params") or {} context = self.load_context() ``` It seems that `load_context()` is called twice during the request: 1. Inside of `get_context()`. 2. Inside of `session.run()`. I've tried to cache the result of `session.load_context()` like this: ```python def get_context(session: KedroSession = Depends(get_session)) -> Iterable[KedroContext]: context = session.load_context() session.load_context = lambda: context yield context ``` And by doing that I've decreased the request processing time to 1.06 seconds. ```sh time curl http://127.0.0.1:8000 # curl http://127.0.0.1:8000 0.00s user 0.00s system 0% cpu 1.062 total ``` Do you have any suggestions on how I can further optimize the `session.run()`? Should I try a different approach with a plain `DataCatalog`/`SequentialRunner`? Or maybe Kedro's implementation of `load_context()` should be modified to use some caching?

merelcht commented 1 year ago

noklam commented 1 year ago

I think this is related too, we need to document and understand what is the use case and what improvements we can make.

https://github.com/kedro-org/kedro/issues/1846

There are some questions we want to ask.

How common Kedro pipeline being exposed as a web endpoint?

Summary (To be updated)

Session can be used once only
Session creation is slow - creating a session for every API call is unsuitable because it runs lots of small pipelines. (significant overhead)
Runner is often used to get rid of the 1 session 1 run assumption, and directly interact with lower-level objects like DataCatalog.
API often need data injection (some parameters) - https://github.com/kedro-org/kedro/discussions/795 - How can we make Kedro pipeline work better with a RESTful API? Is there an easy way that user can pass extra data (common in a RESTful call with JSON) and trigger a Kedro pipeline?
- One example is custom Runner and interact with Catalog directly (I am guessing the use of .add_feed_dict directly to inject data) https://github.com/kedro-org/kedro/issues/2169#issuecomment-1447001865
What's the downside of using Runner?
- The hook system is built for session instead of runner

astrojuanlu commented 7 months ago

Moving this to the Session milestone

astrojuanlu commented 5 months ago

In light of the interest that kedro-boot is getting (lots of mentions in Slack), that the authors @takikadiri and @Galileo-Galilei have already poured lots of thought on its design, and that we mostly agreed in Tech Design https://github.com/kedro-org/kedro/issues/2169#issuecomment-1945946338 that this is an idea worth pursuing, are we ready for at least a first exploration of this issue from a technical standpoint?

@merelcht @rashidakanchwala I often hear that "the KedroSession was created for Experiment Tracking", do you happen to have any pointers? And besides, should https://github.com/kedro-org/kedro-viz/issues/1624 be a blocker?

datajoely commented 5 months ago

Can I please volunteer myself for a user interview on how my teams have approached this!

kedro-org / kedro

Investigate how session/context should be re-designed to work well for API use-cases #2182

Discussed in https://github.com/kedro-org/kedro/discussions/2134