kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.89k stars 897 forks source link

Investigate how session/context should be re-designed to work well for API use-cases #2182

Open datajoely opened 1 year ago

datajoely commented 1 year ago

Discussed in https://github.com/kedro-org/kedro/discussions/2134

Originally posted by **illia-shkroba** December 16, 2022 Hello. I'm trying to build RestAPI with FastAPI that runs Kedro Pipeline under the hood and come up with this solution: ```python import pathlib from typing import Any, Iterable from fastapi import Depends, FastAPI from kedro.framework.context import KedroContext from kedro.framework.session import KedroSession from kedro.framework.startup import bootstrap_project app = FastAPI( title="FastAPI + Kedro", version="0.0.1", license_info={ "name": "GNU GENERAL PUBLIC LICENSE", "url": "https://www.gnu.org/licenses/gpl-3.0.html", }, ) def get_session() -> Iterable[KedroSession]: bootstrap_project(pathlib.Path().cwd()) with KedroSession.create() as session: yield session def get_context(session: KedroSession = Depends(get_session)) -> Iterable[KedroContext]: yield session.load_context() @app.get("/") def index( session: KedroSession = Depends(get_session), context: KedroContext = Depends(get_context), ) -> dict[str, Any]: session.run("math") catalog = context.catalog return catalog.load("output") ``` `session.run("math")` runs a simple pipeline that calculates a variance for the input: `[1, 2, 3]`. The solution seems to work as expected, but it takes nearly 2.1 seconds to finish a request: ```sh time curl http://127.0.0.1:8000 # curl http://127.0.0.1:8000 0.00s user 0.00s system 0% cpu 2.097 total ``` I've noticed that `session.load_context()` takes about 1 second to finish. Also I've found that `load_context()` is used by `session.run()`: ```python session_id = self.store["session_id"] save_version = session_id extra_params = self.store.get("extra_params") or {} context = self.load_context() ``` It seems that `load_context()` is called twice during the request: 1. Inside of `get_context()`. 2. Inside of `session.run()`. I've tried to cache the result of `session.load_context()` like this: ```python def get_context(session: KedroSession = Depends(get_session)) -> Iterable[KedroContext]: context = session.load_context() session.load_context = lambda: context yield context ``` And by doing that I've decreased the request processing time to 1.06 seconds. ```sh time curl http://127.0.0.1:8000 # curl http://127.0.0.1:8000 0.00s user 0.00s system 0% cpu 1.062 total ``` Do you have any suggestions on how I can further optimize the `session.run()`? Should I try a different approach with a plain `DataCatalog`/`SequentialRunner`? Or maybe Kedro's implementation of `load_context()` should be modified to use some caching?
merelcht commented 1 year ago

Related discussion: https://github.com/kedro-org/kedro/issues/2169#issuecomment-1445426832

noklam commented 1 year ago

I think this is related too, we need to document and understand what is the use case and what improvements we can make.

There are some questions we want to ask.

Summary (To be updated)

astrojuanlu commented 7 months ago

Moving this to the Session milestone

astrojuanlu commented 5 months ago

In light of the interest that kedro-boot is getting (lots of mentions in Slack), that the authors @takikadiri and @Galileo-Galilei have already poured lots of thought on its design, and that we mostly agreed in Tech Design https://github.com/kedro-org/kedro/issues/2169#issuecomment-1945946338 that this is an idea worth pursuing, are we ready for at least a first exploration of this issue from a technical standpoint?

@merelcht @rashidakanchwala I often hear that "the KedroSession was created for Experiment Tracking", do you happen to have any pointers? And besides, should https://github.com/kedro-org/kedro-viz/issues/1624 be a blocker?

datajoely commented 5 months ago

Can I please volunteer myself for a user interview on how my teams have approached this!