kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.53k stars 877 forks source link

Use mlflow for better versioning and collaboration #113

Closed Galileo-Galilei closed 3 years ago

Galileo-Galilei commented 4 years ago

TL;DR : The plugin is in active development here and is available on PyPI. It already works reliably with kedro>=0.16.0 but is slightly different (and much more complete) of what is described in below issue. Feel free to try it out and give feedback. The plugin enforces Kedro design principles when integrating mlflow (strict separation I/O vs compute, external configuration, data abstraction, cli wrappers...) to avoid breaking the Kedro experience when using mlflow and facilitate versioning and model serving.


A huge thanks for the framework, which is really useful. My team decided to use it for most of its projects, especially to ensure collaboration. Data abstraction is really an important feature. However, we have a major disagreement of how data versioning is implemented in kedro. We decided to move on and to develop our own layer of versioning above your framework.

I'd be glad to discuss with kedro developpers about some architectures / design choices about it, and this is the goal of the issue.

Context

Versioning in machine learning is something very specific : you want to version a run, i.e. the execution of code on data with parameters. Versioning data alone is likely to be useless for reproducibility in future.

Databricks released recently mlflow which is intended to match this very goal. I think that it will be beneficial from kedro to build on top of what mlflow has already created in order to :

Description

The current internal versioning method in kedro does not intend to version a full "run" (code + data +parameters) which make it less useful for machine learning. Switching to mlflow for this would be a quick win to improve the framework.

Possible Implementation

My team had implemented several features :

  1. Implement a configuration file for mlflow (mlflow.yml file in conf/base folder) which enables to parameterize all mlflow features through a conf file (autologging parameters, tracking uri, experiment where the run should be stored...) which is added to the template. This is really useful since we used a "local" mlflow server where each data scientist can experiment, and a shared one with shareable models and runs and this is nice to parameterize this through a config file.
  2. Create an MlflowDataset class (similar to the AbstractVersionedDataset class) which enables to decide a dataset should be logged as an mlflow artifact (i.e. the versioned parameters in catalog.ymlis replaced by a use_mlflow: true that you can pass to any dataset. This logs automatically the dataset as an mlflow artifact. As a best practice, we consider that we should versioned only datasets that are fitted on data (e.g. encoder, binarizer, machine learning models...)
  3. Each time run_node is called, the parameters that are used in the node are logged as mlflow parameters (through mlflow.log_params). This is customizable in the mlfow.yml conf file.
  4. Implement a cli kedro pull --run-id MLFLOW_RUN_ID that enables to get data from an mlflow run and to copy them in your data folder. This is really convenient to share a run with coworkers (especially since we can also retrieve the commit sha from mlflow to have the exact same code). This pull command also pull parameters and write them in an mlflow_parameters.yml. It warns you about conflicts (parameters which exists both in your local conf and the mlflow run you've just pulled) and you can select by hand which one you want to keep. (To make kedro pull works, we also decided to log some configuration files as artifacts, including the catalog and the parameters when using kedro run, but this is purely technical)

General thoughts about the feature

I wish I had the thoughts of kedro developers:

I can understand that developers want kedro to be "self contained" and not rely on a third party application. However, I think it is definitely not a good idea to reinvent the wheel. Besides, such a change would not be harmful for kedro users :

  1. If they don't want to version their dataset, it does not change anything
  2. If they don't want to create a mlflow server, you can just add a "mlruns" folders in the kedro projects that will gather data versioned in mlflow (mlflow can stored data locally, even if it is not intended to trough mlflow server). AFAIK, this is really similar to what is currently done with kedro versioning.

I think it is a good way to use the "best of the 2 worlds" (mlflow offers a configuration through an MLProject file which is overlapping and less flexible that kedro's AFAIK, so I'd rather stick to kedro for this).

yetudada commented 4 years ago

Hi @Galileo-Galilei! We're so glad to hear that you've found Kedro useful and I think it's fantastic that you're building on top of it.

Let me see if I can address some of the thoughts that you've raised: 1) We agree that data versioning, when it's not linked to code, isn't useful. User feedback prompted us to extend data versioning and we're releasing the Journal, linked data and code in our next release. 2) We really like how you've approached building Mlflow functionality into Kedro. We actually think this Mlflow functionality could be extended into a kedro-mlflow plugin. For instance, you could enable functionality in 3. as a node decorator. 3) In a sense you're right that we would like Kedro to be self-contained but we have tried to make it extendable to cover cases like this. Please do let us know if you'll need help leading on something like kedro-mlflow. I'll tag @tolomea here who created the plugin framework and kedro-airflow. He also works a lot with Mlflow because of our internal tool, PerformanceAI, which is built on top of Mlflow. 4) And lastly, where are you from? We think your work is great!

Galileo-Galilei commented 4 years ago

Hello @yetudada, many thanks for the reply. I was quite busy at work recently but I will definitely try to make a kedro-mlflow plugin by the end of the year.

Some comments about the different points you answered to :

  1. I've seen the new feature Journal in the development branch, but we definitely want to stick to mlflow for versioning because we also use it to serve the model and to monitor the app.

  2. a. Actually creating a plugin was our first idea but since we made many modifications to the package to adress other specific concerns (especially integration to our internal CI/CD) it was quicker to modify directly the package. I will try to develop a kedro-mlflow plugin by the end of the year if I have enought time to do so. b. The functionality 3. cannot be implemented as a node decorator (this was what we made at first in our first sprint). Indeed there are 3 things to map for a variable : its name in the catalog, the name of the arguments of the function is is passed through and its value. In the following code snippet, we need to access inputs dictionary,

https://github.com/quantumblacklabs/kedro/blob/e332e2e63f89621da01507b1c1de4c9d644f3ee3/kedro/runner/runner.py#L169-L184

but when passing a decorator I can only access to the kwargs which no longer contains the names of the datacatalog (which are the ones I want to log): https://github.com/quantumblacklabs/kedro/blob/847aa0f4d419a167608b4fb675ea347c7a617bcf/kedro/pipeline/node.py#L460-L461

I do not see how I can log the inputs without access to run_node (but I am open to any less hacky solution).

  1. Thanks for the support! I'll ask @tolomea when I'll have a first version of this plugin to discuss about architecture concerns.

  2. I work in a huge bank, but the opinions I express here cannot be considered as those of my employer :-) Using kedro is internal to my team (and few others AFAIK) and is far from being an official standard.

  3. Btw I've seen that most of the features you released in 0.15.2 are very coherent with this discussion (the possibility to kedro run --load-version is very similar to what is described in my kedro mlflow pull --run-id RUN_ID command, and the ability to create modular pipelines is very useful to create a custom "mlflow flavour" for prediction, which is very hacky in our actual implementation).

iver56 commented 4 years ago

Interesting! I really agree with you guys that MLflow is a natural extension of Kedro. At MFlux.ai, we have made a tutorial that shows a simple example of how to use a combination of Kedro and MLflow in one project: https://www.mflux.ai/tutorials/ml-pipeline/

yetudada commented 4 years ago

Hi @Galileo-Galilei! I hope that you're well. Let me work my way through your comments.

  1. We have the Journal out! Think of it as your diary for your pipeline runs. You can check out how it works here. Additional features we're thinking about supporting in the Journal include automatic reproduction of a previous pipeline run but we'll await feedback on it before this is pursued.

2 a. This makes sense. Let us know if you need help with kedro-mlflow. 2 b. I see what the issue was. I'm going to put this on the backlog to discuss this with the team.

  1. Well you're well on our way to helping us improve it. Thank so much for raising this issue!

  2. Check out kedro run --load-version and please do let us know your thoughts on it. And you can now pick up the modular pipeline structure, check out the documentation for it here.

  3. We're getting ready to publish an example workflow using Kedro + MLflow on our Medium page. Would you be okay if I referenced how you thought of the two working together in the piece?

yetudada commented 4 years ago

@iver56 This is great! What was your experience like using both tools?

iver56 commented 4 years ago

@yetudada I'm glad you liked it!

Adding mlflow to a kedro project felt like just adding some mlflow function calls here and there. I feel like mlflow fits in most places without a need to rewrite whatever code that's already lying around. Kedro is more opinionated - it's more like a starting template for data science projects. If I want to start using Kedro in an existing code base, it has significant implications - I feel like I have to rewrite/refactor code and do many things in the Kedro way, which sometimes can feel like a hindrance. But on the other hand, I guess doing things in the Kedro way gives the project a common structure that looks familiar to other people that are familiar with Kedro. That is an obvious benefit in medium-sized to large companies that use Kedro and have data engineers and data scientists that come and go.

Galileo-Galilei commented 4 years ago

Hello @yetudada, some news about our progress:

  1. My team is about to send its first kedro-based projects into production. We have trained almost all the people and team leads / management agreed on using to standardize as many projects as we can. We will definitely try the Journal and compare its abilities with our own solution (and keep what suits best for our use).

  2. 3..4 I maintain that I will likely make a 0.1 release for the plugin by the end of the year. So far, some functionalities are easy to implement with slight modifications of your package but more "hacky" through a plugin and we are thinking about the best ways to implement them. For instance, enabling to log any existing kedro AbstractDataset is easy if we make abstract dataset inheriting from a metaclass which "wraps" automagically all save methods of all children Dataset class. Creating a MlflowDataset requires more coding and is a bit redundant. It would be nicer that we can pass it to a dataset as a parameters into DataCatalog (we can imagine many ways to do that, one easy would be enabling decorating the save and load of Datasets through a dedicated parametes in the catalog.yml conf file. I might open a feature request but I will try to define precisely what is needed and figure out if it is really a good idea first).

  3. I will try it in the next few weeks.

  4. Sure, I'm glad you find it worth to share!

@iver56 You're definitely right : the first reason that make us bend towards kedro is that it make collaboration much easier (you can actually show your pipeline with kedro viz to anyone, and it represents actual code, not theoretical specifications + the template is similar for all projects). In a sense you are right that mlflow is much easier to add in an existing code, but this is also what we do not like for production purpose : it adds unrelated elements to the function and makes the code messier. We want to avoid this as much as possible for maintenance. I guess it always the same flexibility/maintenability dilemma.

yetudada commented 4 years ago

@yetudada I'm glad you liked it!

Adding mlflow to a kedro project felt like just adding some mlflow function calls here and there. I feel like mlflow fits in most places without a need to rewrite whatever code that's already lying around. Kedro is more opinionated - it's more like a starting template for data science projects. If I want to start using Kedro in an existing code base, it has significant implications - I feel like I have to rewrite/refactor code and do many things in the Kedro way, which sometimes can feel like a hindrance. But on the other hand, I guess doing things in the Kedro way gives the project a common structure that looks familiar to other people that are familiar with Kedro. That is an obvious benefit in medium-sized to large companies that use Kedro and have data engineers and data scientists that come and go.

@iver56 That's a really great perspective. You're so correct with one of the reasons why Kedro exists; the point of creating maintainable code bases when teams change in large organisations. What changes did you have to make to your workflow to work in the Kedro way?

yetudada commented 4 years ago

@Galileo-Galilei, you're such a rockstar ๐Ÿš€Well done on deploying your Kedro pipelines and getting everyone up to speed on Kedro! This makes us so happy to read this!

  1. Have you managed to checkout the Journal? Any thoughts on it? We're totally open to feedback on it.

  2. You could explore creating a contrib transformer for this and submitting the pull request for this? Transformers intercept the load and save operations on DataSets.

  3. We published the article on the QuantumBlack Medium and gave the shoutout to you in the conclusion: https://medium.com/@QuantumBlack/deploying-and-versioning-data-pipelines-at-scale-942b1d81b5f5

yetudada commented 4 years ago

@Galileo-Galilei I should let you know that @limdauto is working on a way to extend Kedro using hooks as part of #219 and has indicated that it's extremely easy to create the MLflow plugin with this system. Are you still using your customisations?

Galileo-Galilei commented 4 years ago

Hello @yetudada, sorry for not coming back here for a while, I was quite busy at work.

Some news and feedback :

  1. The Journal is quite an interesting feature (it makes runs more reproductible than before with detailed information), but I find it (this is a personal feeling, no offense intended) almost useless without a user interface to browse the different runs and find which one I want to keep / reuse. Mlflow offers this user interface and that is why my team decided to stick to it. Besides, mlflow enables to log metrics / artifacts with the run, which make the runs much more "searchable" (you can easiliy filter / retrieve a run with specific features which seems not easy with the Journal).

  2. A contrib transformer may be the most "kedro compatible" way to do it, but it forces the user to modify its ProjectContext to decide which elements must be logged in mlflow. We do not want this because it creates a lot of configuration back and forth between the ProjectContext of the run.py file and the catalog.yml and this is not very user-friendly. This is also very likely error prone. The solution we came up with is to wrap on the fly Datasets save methods(which enable configuring only in the catalog.yml), but it is quite a dangerous solution because it hides the behaviour to the user. We haven't decided the best solution yet.

  3. Many thanks, I've read the article and I found it very interesting. Thanks for the credit!

  4. We still use our customisation extensively. I've read the discussion in #219 and here are some piece of thoughts:

    Pros for keeping most of the logic in ProjectContext:

    a. It enables handling very specific situation at the project levels, which are not intended to be generic b. I personnaly find the class very easy to extend

    Cons for keeping most of the logic in ProjectContext:

    a. Currently, I extend the context by creating inheritance fom KedroContext and then make project contex tinhertis from my custom class. One major drawback of this approach is that it is difficult to compose two different logic even if they do not intterfer with each other. Example: Imagine that I have created a specific logic for mlflow :

    
    # mlflow_context.py
    class MlflowContext(KedroContext):
    my mlflow logic here

file run.py of my project

import mlflow_context class ProjectContext(MlflowContext): my project here

Everything is fine. Imagine now that I have also some Spark logic:

spark_context.py

class SparkContext(KedroContext): my spark logic here


I cannot (easily) inherits from both ```mlflow_context``` and ```spark_context```. The solution i often use is to define a priority, and ake either MlflowContext or SparkContext inhertis from one another but it is not very satisfying.
b. Currently, some *templates related* methods (e.g. the ones that needs to know the name of the package in the template, (like https://github.com/quantumblacklabs/kedro/blob/57cf26a4ae9f11e942cd630dbb4dda71e1edf034/kedro/template/%7B%7B%20cookiecutter.repo_name%20%7D%7D/src/%7B%7B%20cookiecutter.python_package%20%7D%7D/run.py#L47-L48 which needs to import ```create_pipelines``` based on the template name)  must be written in ```ProjectContext``` and not in a mother class, which is not very user friendly and makes portability more difficult.

**Conclusion: I have never used ```pluggy```  before and I may be wrong, but i had a quick glance at the documentation and it seems to be able to overcome these shortcomings which is IMHO another step forward in the right direction.** I'd be glad to see what's coming next in kedro. Once again, I think that plugin management in kedro is already fantastic and enable to custom it both easily and deeply. Thanks for the amazing job!
yetudada commented 4 years ago

Hey @Galileo-Galilei! I have so many cool things to tell you!

  1. We're going to be looking at the Journal after we complete these roadmap items. Discoverability of runs is definitely one of its challenges. We've also seen instances of users trying to log the Journal itself to MLflow, which is interesting because we need to evaluate its role in run reproducibility.

  2. & 4. Here's the cool news! In the develop branch of Kedro you will find a new feature called Hooks. @limdauto describes it in the following way: "We provide a composition interface, enabling users to combine multiple sets of hooks, e.g. MLflow, Spark, etc. You will no longer be limited by the existing inheritance interface by only being able to inherit one set of additional behaviours at a time."

We actually do have an MLflow example ready for you to try:

Let us know if you want a crash-course demo, and feel like spending time with the team.

yetudada commented 4 years ago

And one more thing @Galileo-Galilei, for 4b. Could you provide an example of what you're trying to do? We're working on the modification part of Framework Redesign indicated in #219 and it would be great to understand if this problem fits there.

Galileo-Galilei commented 4 years ago

Hello @yetudada,

First, I have some very good news : I released a first version of kedro-mlflow. It will make our discussions more efficient as I can show you the code. By now, the package is poorly documented / tested and lacks some functionalities, but I will update it in the following months. Note that is based on kedro's develop branch and uses hooks. It is not compatible with any of the current official releases. I will try to make an architecture schema ASAP to explain more easily what it does.

Regarding your questions:

  1. That sounds good. I barely used it (I prefer my own kedro-mlflow version for my use cases) but I'm interested to see what you come up with.
  2. The Hook sytem is very nice, I integrated effortelessly as soon as you released it.

Regarding 4.b, this was more a general thought on how the design should (IHMO) separate the template from the framework. Basically, I think that some informations should move from the `ProjectContext to the ".kedro.yml" file, because you may want to access it without loading the context. I'll make a detailed answer one day (likely in a new issue) but I have no time right now and it needs to be thoroughly thought (I don't have all the implications in mind right now of the changes I would suggest).

lorenabalan commented 3 years ago

Given the kedro-mlflow plugin is now on PyPI, are we good to close this issue? ๐Ÿ™‚

Galileo-Galilei commented 3 years ago

Yes, sure. Opening a PR to add the plugin to the list of Community developed plugins lies somewhere on my todo list, I'll try to do it in a near future!

PS: This is not directly linked to this issue, but the last comment about moving all the informations (project name, kedro version, place to register the configloader/ the pipelines...) from the context to either the kedro.yml or another file (hooks.py) is totally in line with your current release kedro==0.16.5, this is exactly what I expected ๐Ÿ‘Œ

yetudada commented 3 years ago

We've seen the continued development on the kedro-mlflow plugin so we'll close this ticket for now. Well done @Galileo-Galilei ๐ŸŽ‰

If you need a slimmed down alternative check out how to integrate MLFlow using Hooks in the Kedro documentation.