Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
They develop locally and start from a starer or the default template
They develop locally and mlflow creates a mlruns folder the first time they run a kedro pipeline
They push the local mlflow runs to a remote forge (Github / Gitlab)
If someone else pull the code on another machine, mlflow completely messes up artifacts path and they got troubles running their pipelines.
This all boils down to a misunderstanding of how mlflow and git works ; that said they are complex tools and beginners are often introduced to the whole tooling kedro / mlflow / git simultaneously, so it is understandable they do not get the point of not pushing data (their mlflow runs) to a remote forge. Besides, kedro-mlflow (and mlflow itself) do a lot of thing under the hood so some people do not even notice that a folder is created.
I suggest we ignore mlruns/* by default in kedro's template to avoid pushing data remotely, because that's the right thing to do (even if people use mlflow without kedro-mlflow) and that's hard to fix by kedro-mlflow (see https://github.com/Galileo-Galilei/kedro-mlflow/issues/523). I think a beginner will think twice before removing something from .gitignore if they don't really understand what is at stake.
Note: The change should also be done in starters.
Development notes
Add mlruns/* to gitgnore, to avoid pushing mlruns and all its subfolders in case it is created by kedro-mlflow. We may document why this folder should not be pushed, but it is likely to be done in kedro-mlflow, or mlflow iteslef.
If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.
Description
I've noticed an increasing number of beginners facing some issues when using
kedro
andkedro-mlflow
with the following workflow (see : https://github.com/Galileo-Galilei/kedro-mlflow/discussions/456):mlruns
folder the first time they run a kedro pipelineThis all boils down to a misunderstanding of how mlflow and git works ; that said they are complex tools and beginners are often introduced to the whole tooling kedro / mlflow / git simultaneously, so it is understandable they do not get the point of not pushing data (their mlflow runs) to a remote forge. Besides, kedro-mlflow (and mlflow itself) do a lot of thing under the hood so some people do not even notice that a folder is created.
I suggest we ignore
mlruns/*
by default in kedro's template to avoid pushing data remotely, because that's the right thing to do (even if people use mlflow without kedro-mlflow) and that's hard to fix by kedro-mlflow (see https://github.com/Galileo-Galilei/kedro-mlflow/issues/523). I think a beginner will think twice before removing something from.gitignore
if they don't really understand what is at stake.Note: The change should also be done in starters.
Development notes
Add
mlruns/*
to gitgnore, to avoid pushing mlruns and all its subfolders in case it is created bykedro-mlflow
. We may document why this folder should not be pushed, but it is likely to be done in kedro-mlflow, or mlflow iteslef.Developer Certificate of Origin
We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a
Signed-off-by
line in the commit message. See our wiki for guidance.If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.
Checklist
RELEASE.md
file