kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.53k stars 877 forks source link

Reusing pipeline elements in a served model scenario #464

Closed turn1a closed 3 years ago

turn1a commented 3 years ago

Dear Kedro crew,

We are working on an extended example using Titanic dataset that is going to showcase a reasonably sophisticated, end-to-end Machine Learning project utilising best Software Engineering, Machine Learning and Kedro practices. We are trying to figure out the best way to serve a model for an online inference that is not only making predictions but is also performing some preprocessing steps on the data being sent to the model.

Here’s some background information regarding our architecture.

We have developed three modular pipelines:

The idea is to share all required training nodes of the training pipeline with the prediction pipeline. We are using predict=True argument to create_pipeline to inform that we want specific training nodes to be included/excluded in prediction pipeline (for example we leave out fit_imputer but include predict_imputer). We recently figured out that a more appropriate way to do this would be using tags but didn’t manage to refactor our code yet.

At the end of the modelling pipeline, we output a model. Our prediction pipeline reuses some training nodes and works very well for batch inference, but we would also like to serve this model for an online inference in a Docker container. For this purpose, we are using MLflow’s pyfunc model wrapping LightGBM, but the model alone is not enough: we need data that goes into the model to go through the steps in data_engineering and feature_engineering.

Here we are stuck, and we don’t know what would be the best practice. The possibilities we thought about are:

  1. We include our Kedro Python package as a dependency of MLflow model wrapper, import all required nodes into it as pure functions (not nodes) and recreate the steps of our pipelines. This approach has a considerable drawback that we repeat what’s already been specified using Kedro pipelines. On top of that, we don’t have access to the main configuration which is not included in the Python package so we would need to extract the datasets and parameters from the context.
  2. Once again, this approach involves including the Kedro Python package but this time we use the Kedro pipelines and a runner. In such a way, we don’t have to worry about pipeline recreation, but we still lack the configuration.
  3. In this scenario, we are not using MLflow Models at all, but we develop our own API, which would utilise the whole Kedro project (including the data catalog, the configuration and other elements). This could be achieved using a kedro plugin that would provide something akin to kedro serve command. The reasons for that would be:
    • We have more control over the serving application. Currently, there is no way to write custom endpoints or responses from MLflow pyfunc model. It would be useful in cases like running different pipelines or using different models, handling validation errors (we are in the middle of developing a kedro-pandera plugin for automatic data validation that returns valuable debugging information) and more.
    • Be able to provide middlewares/hooks which could log input data and predictions, monitor drift and feed those into Kedro pipelines.
    • Handle authentication.

Considering the third option, we imagined that it could be automated (with a plugin), and the whole web server could be generated/filled with project pipelines/hooks/other components. It could be packed into a container using kedro-docker and easily deployed after that. Right now, the third option is a simple idea, and we didn’t dive into the details as we wanted to ask for your advice first.

I hope we just missed something and that there is a kedro way/solution to such a scenario.

Galileo-Galilei commented 3 years ago

Hello @kaemo,

@takikadiri and I are going to meet @yetudada @DmitriiDeriabinQB and @laisbsc this week as part of their feedback program. This very question will arise in the discussion as one of our "hottest " current problem.

Our current solution consists in a mix of 2. and 3. approach:

We are thinking about either:

but both come with huge maintenance costs on our side, and we have decided to stick to our current "encapsulating API process" until it hits its limits and becomes untractable for a given project.

I guess kedro-server and universal deployer will adress some of these concerns but AFAIK there is no kedro way to serve a pipeline right now. Maybe one member of the kedro team will more concrete elements on this?

limdauto commented 3 years ago

Hi @kaemo and @Galileo-Galilei, here would be my approach:

Let me know what you think of this approach. cc @yetudada @DmitriiDeriabinQB

DmitriiDeriabinQB commented 3 years ago

I think model scoring naturally falls into that "model deployment" epic, which we did identified, but not yet formalised into anything concrete. Kedro in its current state focuses primarily on batch processing, therefore online inference (especially dependant on some feature engineering steps) is out of its scope.

The approach described in option 3 in the original post makes general sense to me, however, understandably, it requires a considerable effort to integrate all moving parts together since Kedro doesn't have anything like "Kedro Server" publicly supported.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.