kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.53k stars 877 forks source link

Universal Kedro deployment (Part 1) - Separate external and applicative configuration to make Kedro cloud native #770

Open Galileo-Galilei opened 3 years ago

Galileo-Galilei commented 3 years ago

Preamble

Dear Kedro team,

I've been using Kedro since July 2019 (kedro==0.14.3, which is quite different from what kedro is now) and my team has deployed in the past 2 years a few dozens of machine learning pipelines in production with Kedro. I want to give you some feedback on my Kedro experience along this journey, and the advantages and drawbacks of Kedro from my team's point of view with the current versions (0.16.x and 0.17.x):

Advantages:

Drawbacks:

This issue is likely the first one of a serie, and I will focus specifically on Kedro's configuration management system. To give some credits, hereafter suggestions come in a vast majority from discussions, trials and errors with @takikadiri when trying to deploy our Kedro projects.

Disclaimer : I may use the words "should" or "must" in the following design document, and use very assertive sentences which reflect my personal opinion. Theses terms must be understood in regards to the underlying software engineering principles I describe explicitly when needed. My sincere apologies if it offends you, it is by no mean an order to do a specific action, I know you have your own clear vision of what Kedro should bend towards.

Context

Deploying a kedro application

A brief description of the workflow

A common workflow (at least for me, as a dev) is to expose some functionalities to an external person (an ops) which will be in charge to create the orchestration pipeline. A sketch of the workflow is the following:

Deployment constraints to deal with

Note that changing the workflow or asking the ops to modify the kedro project are out of the list of possible solutions, since I work in a huge organisation with strictly standardized processes that cannot be modified only for my team.

Challenges created by Kedro's configuration management implementation

Identifying the missing functionality: overriding configuration at runtime

In regards of previously described workflow, it should be clear that the ops must be able inject some configuration at runtime, e.g. some credentials (password to database connexion, to mlflow), some path to the data, eventually some parameters... This should be done without modifying the yaml config files : the project folder is not even visible for the ops, and we want to avoid operational risk if he were to modify configuration of a project he knows nothing about.

Overview of potential solutions and their associated issues as of kedro==0.17.3

With the current version of kedro, we have two possibilities when packaging our project to make it "executable":

  1. either we do not package the configuration with the code, and we expect our users to recreate the entire configuration and the folder structure by themselves. In my experience, this is really something users struggle with because:
    • the configuration files are often complex (dozens or even hundreds of rows in a catalog.yml seems common)
    • you need to have some knowledge about Kedro and about the business logic of the underlying pipelines to recreate these files, and this is hardly possible without having the code
    • even worse, this is not even something acceptable in the workflow described above.
  2. or we do package the entire configuration with the code (e.g. by moving the conf folder to src/, or by packaging the entire folder (e.g. with a run.sh file at the root to make it "executable like"). This is roughly what is suggested by @WaylonWalker in #704 and while it is in my opinion better than the previous bullet point, it is not acceptable as is for the following reasons:
    • The major issue with this solution is that you need to redeploy the entire app each time you want to change the configuration (even if the modification is not related to the "business logic" of your pipeline, e.g. the path to your output file). This completely breaks the build once, deploy everywhere principle.
    • This also tightens a lot the business logic of your pipelines with the environment where the code is deployed which stands against software engineering best practices IMHO.
    • You will likely need to package some sensitive configuration to make the package work (e.g. credentials like passwords for production database connexion), which is definitely a no go.

As a conclusion, both solutions have critical flaws and cannot be considered as the correct way to handle configuration management when deploying a kedro project as a standalone application.

Thoughts and design suggestions for refactoring configuration management

Underlying software engineering principles : decoupling the applicative configuration from the external configuration

All the problems come from the fact that Kedro currently consider all configuration files as identical while they have different roles:

Refactoring the configuration management

Part 1: Refactor the template to make a clear separation between external and application configuration

I suggest to refactor the project template from this:

.
|-- .ipython/profile_default/startup
|-- external-conf
|   |-- local  # contains only credentials and globals
|       |-- credentials.yml
|       |-- globals.yml  # This is the ``global.yml`` file of the TemplatedConfigLoader
|   |-- another-env  # optional, the user can add as many env as he wants
|       |-- credentials.yml
|       |-- globals.yml
|-- data
|-- docs
|-- logs
|-- notebooks
|-- src
|   |-- test
|   |-- <python_package>
|   |   |-- pipelines
|   |   |-- hooks.py # modify it to make the TemplatedConfigLoader the default
|   |-- applicative-conf # this is the current "base" environment, renamed and moved to the src/ folder
|   |   |-- catalog.yml
|   |   |-- parameters.yml
|   |   |-- logging.yml
|   |   |-- globals_default.yml # some defaults to global, in case you want to package them alongside with the app to avoid making mandatory to the user to specify the globals. 
...

With such a setup, the applicative configuration should be packaged with the project, which will make the pipelines much more portable. Two key components should be updated to match all the constraints: the TemplatedConfigLoader and the run CLI command.

Part 2: Update the ConfigLoader

With this system, the dev would choose and define explictly what is exposed to the end users thanks to the TemplatedConfigLoader system, e.g.:

# src/applicative-conf/parameters.yml

number_of_trees: ${NUMBER_OF_TREES} # exposed
alpha_penalty: 0.1 # not exposed
# src/applicative-conf/catalog.yml

my_input_dataset:
  type: pandas.CSVDataSet
  filepath: ${INPUT_PATH}
  credentials: ${INPUT_CREDENTIALS}
# external-conf/local/credentials.yml

INPUT_CREDENTIALS: <MY_VERY_SECURED_PASSWORD>
# src/applicative-conf/globals_default.yml

NUMBER_OF_TREES: 100
INPUT_PATH: data/01_raw/my_dataset.csv

Part 3 (optional) : Update the run command

If possible, the run command should explictly enable to dynamically expose only the variables in the globals. once packaged, the end user would be able to run the project with either:

The end user cannot modify what is not exposed by the end user through the CLI or env variables (e.g. save args for the CSVDataSet), except if they are exposed in the globals_default.yml file and made dynamic by the developer. Obviously, the user can still recreate a conf/<env-folder>/catalog.yml folder to override the configuration, but he should not be forced (nor even encouraged) to do this.

Alternative considered

I could create a plugin to implement such changes by creating a custom ProjectContext class, but the suggested template changes, albeit easy to implement, would make it hard to follow your numerous evolutions in the template. It would make much more sense to implement at least these template changes in the core library.

@yetudada, sorry to ping directly, but you told me you were working on configuration refactoring. Do such changes make sense in the global picture you have in mind?

mzjp2 commented 3 years ago

(ignore me, just butting in here) to say that this is an amazingly well written issue - one of the best and most thorough I've seen in a long time.

WaylonWalker commented 3 years ago

Still need to digest this completely. One thing I give props to the kedro team for, regarding templates, is the move from 0.16.x to 0.17.x. It was very very hard to work outside of the standard template in 0.16.x. It would flat error and not let you do things a "different" way in some cases.

Composability

0.17.x is MUCH more modular. You can compose your own template quite easily by composing the components of kedro you wish to use. To the point where you can easily create a pipeline, catalog, runner, and cli with very little code in a single script. In fact, I've done it. After working with DAGs for the past few years it feels very slow to work without one now. In some cases where there is a significant project already complete, it may not make sense to completely port to kedro, but rather bring in a bit of kedro as you maintain it.

I treat Everything as a Package

I generally think of everything as a package, something that I can pip install, run from the command line. Or in the case of production put into a docker image. I think this workflow/deployment is what has to lead me to put everything into the package. I think it would be completely logical to find a balance of letting the user override parameters while providing good defaults for all of them inside your package. Again this is probably my small view into how I work.

idanov commented 3 years ago

@Galileo-Galilei First, really thank you for this well-written issue and great analysis of some of the main challenges we currently face in Kedro. The things you have pointed out are a real problem we are trying to address, and we certainly are aware of those challenges. Your thoughts on that are really helpful since we mainly have access to the perspective of McKinsey and QuantumBlack users and hearing the viewpoint of someone not affiliated with our organisations is super valuable.

I would like to add a few comments and maybe some clarifications on our thinking (or at times mostly my thoughts as Kedro's Tech Lead, since some of those might not have crystallised completely yet to be adopted as the official view of the team).

Deployment / orchestration

A lot of Kedro is inspired by the relevant bits of The Twelve-Factor App methodology in order to aid deployment. Initially Kedro was often mistaken for an orchestrator, but the goal of Kedro has always been to be a framework helping the creation of data science apps which can then be deployable to different orchestrators. However this view might not have been perfectly reflected in the architecture due to lack of experience on our side and user side alike. Most recent changes in the architecture though have moved towards that direction as @WaylonWalker pointed out.

In the future we’ll double down on the package deployment mode, e.g. you should be able to run your project as a Kedro package and the only necessary bit would be providing the configuration (currently under conf/). The latest architectural changes from 0.17 and the upcoming 0.18 should allow us to significantly decrease the number of breaking changes for project upgrades. A lot of our work is ensuring that we are backwards compatible, which makes it harder for us to experiment, thus reducing our speed of delivering on the mentioned challenges.

Now for the deployment model, we see a future where our users will structure their pipelines using namespaces (aka modular pipelines). Thus they will form hierarchies of nodes, where the grouping would be semantically significant for them. The top-level pipelines will be consisting of multiple modular pipelines, joined together into the overall dag. This way modular pipelines can be analogous to folders and nodes to files, e.g.

image

After having your pipeline structured like that, then we can provide a uniform deployment plugin where users can decide the level at which their nodes will be run in the orchestrator, e.g. imagine something like kedro deploy airflow --level 2 which will make sure that the output configuration will run each node separately, but collapse the nodes at level 3 as singular tasks in the orchestrator.

There’s some additional subtleties we need to take care of, e.g. running different namespaces on different types of machines, e.g. GPU instances, Spark clusters, etc. But I guess the general idea is clear - the pipeline developer will have a much better control on how things get deployed without actually needing to learn another concept or make big nodes. They will just need to make sure that their pipeline is structured semantically meaningful for them and the orchestration, which is already an implicit requirement anyways and people tend to do that as per your example, but not in a standard way.

Configuration

Logging

This one is supposed to be not needed, since Kedro has exactly the same defaults. So teams can directly get rid of it, unless they would like to change the logging pattern for different platforms, e.g. if you would like to redirect all your logs towards an ElasticSearch cluster, Sumologic or any other log collecting service out there. This configuration is environment specific (locally you might want colourful logging, but on your orchestrator that will be undesirable) and that's why it's not a good idea to package it with your code.

Credentials

This one is obviously environment specific, but what we should consider doing is adding an environment variables support. Unfortunately this has been on the backlog for a while, but doesn’t seem to be such an important issue that cannot be solved by DevOps, so we never got to implementing the environment variables for credentials.

Catalog

This is a way bigger topic and much less clear how to solve it in a clean way, but something we have on our radar for quite some time. We want to come up with a neat solution for this one by the end of 2021, but obviously there’s many factors that will come into play and I cannot guarantee we can get it done by then.

History of the problem

In my opinion, this challenge came from the fact that we treat each dataset as a unique type of data and this comes from the fact that we did not foresee that Kedro will enable the creation of huge pipelines on the order of hundreds of nodes with hundreds of datasets. However now most of our users internally have very big pipelines and a lot of intermediary datasets, which need to be defined in the catalog and not just passed in memory. Thus that created huge configuration files, which a lot of people wanted to simplify. That’s why the TemplatedConfigLoader was born out of user demand and not without some hesitation from our side.

Why the current model is failing

The problem with the TemplatedConfigLoader is that it solves the symptom, but not the real problem. The symptom is the burdensome creation of many catalog entries. The problem is the need for those entries to exist at all. Maybe to clarify here, I will refer to web frameworks like Django or Rails - in all web frameworks, you define only one database connection and then the ORM implicitly maps the objects to that database. In Kedro, each object (i.e. dataset) needs to be configured on its own. Kedro’s model is good if you have a lot of heterogenous datasources (like the case of pipelines fetching data from multiple independent sources). But it quickly dissolves into chaos as you add multiple layers of intermediary datasets, which are, if not always, then for the most part of it, pointing to the same location and can be entirely derived from the name of the dataset. So the challenge here is that we need to support both per-dataset catalog entries and one configuration entry for hundreds of datasets. Whatever solution we come up with needs to work for both cases and be declarative at the same time.

Why the catalog is configuration and not code

As we are trying to emulate the one-build-multiple-deployments model, it becomes very clear that all catalog entries are entirely environment specific (e.g. with one build you might deploy once to S3 and then the second time to ABS or GCS). So this is definitely configuration that needs to live outside your codebase. However the current mode of defining every single dataset separately makes this process completely unmaintainable, so people came up with the templated config solution with the globals.yml to factor out only the useful configuration. Some of our internal users went even further, where they have all the catalog entries part of their codebase and only the globals.yml to be treated as real config.

Parameters

The parameters configuration is an odd one because everyone uses it for different things. E.g. we see many users using it as a way to document their default values of all of their parameters, even when they don’t need to change that parameter. That made the parameter files huge and now they are very hard to understand without some domain knowledge. Some teams use these files as a way for non-technical users to do experiments on their own. Some teams would love to package their parameters in their code, since they treat it as a single place for all their global variables that they can use across their pipeline.

The main challenge I see for the parameters files is that the way we merge those from base/ to the other environments is by the means of a destructive merge. The result of that is that if you have a highly-nested parameter structure and you want to change only one parameter from the default values, you need to define the full tree on the way to the parameter.

One can argue that there should be a way to have a place in your src/ folder where you can define default parameters, so the users need to provide parameters configuration only when something deviates from the defaults. When we revisit our configuration management, we’d look into solutions for this, as well as a non-destructive parameter overriding (which also has drawbacks).

Summary

I might not have answered any questions here or even given very specific directions on how Kedro will develop in the future, but the reason for that is that we don’t have very clear direction set yet on solving those problems. I hope that I have provided some insight into our understanding of the same problems and potentially clarifications why we haven’t solved them yet. One thing is sure though, we have this on our roadmap already and its turn is coming soon, e.g. there’s only 2 other things in front of it 🙂 Thanks for sharing your view on how we could tackle that and while we might not implement it as you have suggested, we'll definitely consider drawing some inspiration from it when we design the new solution. One particular detail that I like is getting rid of the base/ environment and making it as part of the source code defaults.

datajoely commented 3 years ago

Hi @Galileo-Galilei - I just wanted to say this is a high priority for us and point you towards our community update later this week, sign up here. The event starts at 6:30 PM here in London - see how that works for your timezone here.

noklam commented 3 years ago

Wow, just discover this thread after I started this thread in GitHub Discussion. This issue is a much more in-depth one and I agree with most of it.

I have been wanting to upgrade kedro but it is not easy and seems that 0.18.x will break something, so I am still waiting for it. @WaylonWalker Could you give an example of how 0.17x makes it easier?

Galileo-Galilei commented 3 years ago

Hi,

thank you very much for all who went on to discuss about the issue at stake here, and especially to @idanov for sharing your vision of kedro's future. This is extremely valuable to @takikadiri and I for increasing Kdero usage inside our organisation.

First of all, apologies to @datajoely: I was aware of this retrospective, but I was (un?)fortunately in vacations this week with almost no internet connection and I could't join it. I had a look at the slides which are very interesting!

Here are some thoughts / answers/ new questions which arise from above conversation, in no specific order:

On 0.17.x increased modularity and flexibility

Disclaimer: I have not used 0.17.x versions intensively, apart from a few tests. I compare the features to the 0.16.X one's hereafter.

For my personal experience, here are my list of pros and cons about 0.17.X features:

My team do not plan to migrate its existing projects because it generates a lot of migration costs (we have dozens of legacy projects + and internal CI/CD to update) and the advantages are not sufficient to yet to justify such costs.

@WaylonWalker, you claim that "0.17.x is MUCH more modular". Do you have any real-world example of something which was not straightforward with 0.16.X versions and which is now much easier?

On treating everything as a package

I perfectly agree on this point (and we do the same), but it raises two different points:

On deployment/orchestration

I have seen your progress on the topic, and I acknowledge that only needing a conf/ folder instead of the entire template would be a first step in making deployment easier. I know that I've been complaining about retrocompatiblity a few lines above, but I follow the development very closely and I see how much efforts you're putting in. Once again, thank you very much for the high quality library you're open sourcing! I understand that small changes helps to ensure an overall quality, and I personnally feel the delivery frequency is already quite fast.

Regarding the deployment model, you are cutting the ground under my feet: In the "Universal Kedro deployment (Part 2)", I plan to adress the transition between different pipelines levels in a very similar way :) Kedro definitely needs a way to "factor and expand" the pipelines to have different view levels. This would be beneficial for a transition to another DAG tool, but also for frontend (kedro-viz visualisation) which becomes overcrowded very quickly. That said, I would not rely on the template's structure for several reasons:

I guess a declarative API (e.g. letting Pipelines to be composed of Pipelines and nodes instead of nodes only would make it easier to use, but I have not thought enough about it. Obviously, all the implementations details you raise show that it needs to be detailed carefully and that a lot of difficulties will arise.

On configuration (back to the original topic :))

Logging

Logging is obviously environment specific, I apologize if you thought I implied the opposite. I just meant we need a default behaviour, but if I understand what you are saying, it is already the case.

Credentials

I do not understand what you mean by "[it] doesn’t seem to be such an important issue that cannot be solved by DevOps". My point is precisely that many CI/CD tools expect to communicate with the underlying application through environment variables (to my knowledge: I must confess that I am far from being a devops expert), and it is really weird to me that is not "native" in kedro. I must switch to the TemplatedConfigLoader on deployment mode even if I use a credentials.yml file while developping, and it feels uncomfortable to have to change something for deployment (even if it is very easy to change).

Whatever the problem is, it should be a minima better documented than it is now, given that some beginners ask this question on various threads, with a few ugly solutions (e.g. https://discourse.kedro.community/t/load-credentials-in-docker-image-using-env-vars/480, #49). The best reference I can find is in the issue #403.

Catalog

First, I agree that it is a big topic, and unlike most others I haven't a clear vision (yet?) of how it should be refactored. Some unsorted thoughts:

but in my opinion the root of all evil comes from this commit c466c8aa36488c8fa8d1fe6bb3d5bcde81100e4a, when the catalog .yml became "code" and no longer configuration with the ability of dynamically creating entries. I strongly advocated against it in my team, even if I understood why some users needed it.

In all web frameworks, you define only one database connection and then the ORM implicitly maps the objects to that database

and I cannot agree more. However, given the "debugging" use of the catalog, I totally agree that you should support both ways (per-dataset configuration and one configuration for several datasets) of defining catalog entries.

Parameters

We encoutered almost all the use case described here (overriding only a nested key, providing a way to experiment for a a non technical user, packaging the parameters) in different projects. The size of the parameters.yml has become a huge problem for us: like the catalog, it often contains "dead config" people do not dare removing and quickly become overcrowded: it is hard to maitain and to read.

Being able to override a nested parameters structure with a syntax like hyperparams.params_a would be indeed user friendly.

As you suggest (and as I describe my original post), my team uses this file to define default values, and the only really "moving" parameters are injected via the globals.yml through the TemplatedConfigLoader.

On your summmary

I might not have answered any questions here or even given very specific directions on how Kedro will develop in the future, but the reason for that is that we don’t have very clear direction set yet on solving those problems. I hope that I have provided some insight into our understanding of the same problems and potentially clarifications why we haven’t solved them yet.

Sharing your vision on this is definitely valuable. I guess it will take a bunch of iterations to tackle the problem completely and reach an entirely satisfaying configuration management system, but some of the ideas discussed in this thread (moving conf/base base to the src/, enabling non destructive merge...) will likely improve greatly the user experience quite easily.

One thing is sure though, we have this on our roadmap already and its turn is coming soon, e.g. there’s only 2 other things in front of it 🙂

I am aware of experiment tracking, I wonder what the other one is ;)

Thanks for sharing your view on how we could tackle that and while we might not implement it as you have suggested, we'll definitely consider drawing some inspiration from it when we design the new solution.

I only care about the implemented features, not the implementation details. The goal of this thread is more to see whether the problem was shared by other teams, and to discuss the pros and cons of the different suggestions.

One particular detail that I like is getting rid of the base/ environment and making it as part of the source code defaults.

It seems quite a consensus in this thread that if we want to reduce the feature request to its core component, this would be the very one thing to implement.

astrojuanlu commented 2 months ago

It's almost the 3 year anniversary of this issue 🎂🎈

I'm watching @ankatiyar's Tech Design session on kedro-airflow (https://github.com/kedro-org/kedro-plugins/issues/25#issuecomment-2107299300), which points to https://github.com/kedro-org/kedro-plugins/issues/672 as one of the issues, and it got me thinking about this.

I'd like to know what folks think about it in the current state of things. I don't want to drop a wall of text so here's my best attempt at summarising my thoughts:

ds:
    type: spark.SparkDataset  # Your code will break if you change this to pandas.CSVDataset
    filepath: ...  # Your code is completely independent from where the data lives, the dataset takes care of it

and in fact @Galileo-Galilei hinted that when he wrote this proposal:

# src/applicative-conf/catalog.yml

my_input_dataset:
  type: pandas.CSVDataSet
  filepath: ${INPUT_PATH}

on the one hand, the catalog.yml and the parameters.yml are project specific (they contain the business logic) and we do not expect our users to modify them, except maybe some very small and specific parts that the dev must choose and control.

It's fuzzy because during development users should be able to freely explore with different configurations for these (see also #1606) but then during production these parameters become "fossilized" and tied to the business logic.


With the experience we've gained in the past 3 years, the improvements in Kedro (namespace pipelines became a reality, OmegaConfigLoader replaced the old TemplatedConfigLoader) and the direction we have (credentials as resolver, less coupling with logging) what's your fresh view on what you're missing from Kedro in this area? What prevents users from shipping YAML files with their code in the way they see fit?

Tagging @lrodriguezlujan and @inigohidalgo because we've spoken about these recently as well.

Galileo-Galilei commented 1 month ago

Hi, this is a very valid question that need to be answered. We've accomplished a lot, and this needs to be reassessed. I have created a demo repository to implement what is suggested above and evaluate how easy it is to configure with recent versions, and what is still to be improved. I'll report my conclusions here when I am ready.