kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.49k stars 874 forks source link

Add mlruns to gitignore to avoid pushing mlflow local runs to github #3765

Closed Galileo-Galilei closed 3 months ago

Galileo-Galilei commented 3 months ago

Description

I've noticed an increasing number of beginners facing some issues when using kedro and kedro-mlflow with the following workflow (see : https://github.com/Galileo-Galilei/kedro-mlflow/discussions/456):

This all boils down to a misunderstanding of how mlflow and git works ; that said they are complex tools and beginners are often introduced to the whole tooling kedro / mlflow / git simultaneously, so it is understandable they do not get the point of not pushing data (their mlflow runs) to a remote forge. Besides, kedro-mlflow (and mlflow itself) do a lot of thing under the hood so some people do not even notice that a folder is created.

I suggest we ignore mlruns/* by default in kedro's template to avoid pushing data remotely, because that's the right thing to do (even if people use mlflow without kedro-mlflow) and that's hard to fix by kedro-mlflow (see https://github.com/Galileo-Galilei/kedro-mlflow/issues/523). I think a beginner will think twice before removing something from .gitignore if they don't really understand what is at stake.

Note: The change should also be done in starters.

Development notes

Add mlruns/* to gitgnore, to avoid pushing mlruns and all its subfolders in case it is created by kedro-mlflow. We may document why this folder should not be pushed, but it is likely to be done in kedro-mlflow, or mlflow iteslef.

Developer Certificate of Origin

We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a Signed-off-by line in the commit message. See our wiki for guidance.

If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.

Checklist

Galileo-Galilei commented 3 months ago

There are some weird doc compiling errors likely totally unrelated to my PR, should I investigae or is this a known issue I can ignore?