kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.53k stars 877 forks source link

[KED-1350] Parameter Tracking #195

Closed uwaisiqbal closed 3 years ago

uwaisiqbal commented 4 years ago

Hey,

Description

I've been using Kedro on a project and had a requirement which I couldn't really figure out how to solve. Maybe I'm missing something really straightforward. I want to be able to version my parameters along with my pipelines and datacatalog items so that I can tie the outputs of my pipeline and the run to a specific parameter configuration. Essentially, I want to use Kedro to track my parameters and pipelines so I can run various experiments with different parameters and track my results.

I think this would be really useful for people who use Kedro for ML and experimental workflows. There are input configuration parameters that could be tracked, but also output parameters as metrics.

Possible Implementation

I'm couldn't find a way to do this, but I think this would be a really good addition to the library. I potential solution I can see is to version the parameters and include them in the Journal so that a specific parameter configuration is tied to a pipeline run and the versioned datasets.

(Optional) Suggest an idea for implementing the addition or change. An alternative is to use another library for parameter tracking like MLFlow but I'm not sure how nicely it would fit into Kedro's existing ecosystem.

Just wanted to say that Kedro is amazing and thank you for the hardwork!

921kiyo commented 4 years ago

Hi @uwaisiqbal Thank you for the valuable feedback!

Regarding the potential extension of Journal to track parameter config, we will discuss this internally to see how we can support the parameter tracking.

Regarding the Kedro/MLFlow, there is a Medium article for how to use Kedro and MLFlow which might be helpful https://medium.com/@QuantumBlack/deploying-and-versioning-data-pipelines-at-scale-942b1d81b5f5

jfogelberg commented 4 years ago

Is it possible to access the journal while running the pipeline? I would like to grab the journal, either inside a node or feed it as an input, so that I can log it to MLflow/SQL database/etc.

This way I can also make sure that the kedro run_id is the same as the MLflow run_name and the the id used for tagging the result in SQL.

921kiyo commented 4 years ago

Hi @jfogelberg the integration with MLflow is quite easy now with our latest feature called hooks (currently in dev branch, and you can find more information in https://github.com/quantumblacklabs/kedro/blob/develop/docs/source/04_user_guide/15_hooks.md) You can also find the example code in https://github.com/quantumblacklabs/kedro-examples/blob/master/kedro-hooks-tutorial/src/kedro_hooks_tutorial/run.py

Please let us know if you have any feedback on this :)

neomatrix369 commented 4 years ago

With reference to this issue, I'm also happy to introduce experiment/parameter tracking to Kedro, via the kedro-plugin route, if there is interest in having such functionality in kedro?

Please do let me know so I can start looking into it. I already have it working in another separate project and don't think should be hard to get it into kedro.

Galileo-Galilei commented 4 years ago

Hello, a bit of self advertisment here : I released a first version of kedro-mlflow, a kedro-plugin which enables parameters versioning (and much more!). Feel free to try it out and give your feedbacks.

yetudada commented 3 years ago

Hi everyone, I hope that you're well! We're going to be releasing a few changes to Kedro that remove the need for this: 1) The Journal is disappearing and being reborn as the Kedro Session, a way to track runtime metadata 2) We're also going to be open-sourcing a Kedro-plugin that tracks everything related to your data science workflow, including parameters - I'll close this ticket when it's out.

In the meanwhile, please do use MLflow as the solve for this.

921kiyo commented 3 years ago

Closing this issue, but feel free to comment on it if anyone has any questions.