kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
10.03k stars 906 forks source link

Add Option to Include Configuration (`conf`) in the Kedro Package for Source Code? #4316

Open MinuraPunchihewa opened 2 weeks ago

MinuraPunchihewa commented 2 weeks ago

Description

At the moment, when packaging a Kedro project using kedro package, two artifacts are created; a .whl file containing the source code and a tar.gz file consisting of the configuration in conf directory.

I understand that it is (in general) a best practice, however, from a user's point of view, these are two artifacts to be maintained. In my CI/CD pipelines as well as in the mechanism for running the code, I need to account for them, which makes the process a little more complicated.

Moreover, when executing the code in a distributed environment like Databricks, performing file system operations such as unzipping the tar.gz file is not exactly straightforward. For instance, the file paths that would be used when dealing with an interactive cluster and a job cluster change slightly.

It would be great if it would be possible to include the conf folder as part of the .whl file (at the user's request). This would make the maintenance, installation and usage of the artifacts easier.

Context

As I mentioned, this could go a long way in improving the usage of package Kedro projects, especially when running in distributed computing environments.

Possible Implementation

I suggest adding a flag to the kedro package command to allow users to choose whether to include the conf folder in the .whl file. The default mechanism can be to avoid doing this.

Possible Alternatives

A possible alternative might be to improve the documentation to explain how Kedro projects can be packaged and run in different systems (with examples), however, this might not be very extensible given the large number of different options that are available for running pipelines.

datajoely commented 2 weeks ago

So we've resisted this for years since it violates the 12factor app the addition of --conf-source is our official solution to this, but users still hit this bit of friction especially before they have proper CI/CD to do this neatly.

I'd be fully in favour of adding an explicit flag to package conf... but throw a warning on why we believe it to be bad practice and not fit for production.

MinuraPunchihewa commented 2 weeks ago

So we've resisted this for years since it violates the 12factor app the addition of --conf-source is our official solution to this, but users still hit this bit of friction especially before they have proper CI/CD to do this neatly.

I'd be fully in favour of adding an explicit flag to package conf... but throw a warning on why we believe it to be bad practice and not fit for production.

Hmm.. Yes, agreed. I think this might actually be the best solution to the problem? It would resolve the issues I am facing in a distributed environment as well.

I am happy to open a PR for both, but I just wanted to confirm if we won't to add an option to include the configuration in the package?

ankatiyar commented 1 week ago

Moreover, when executing the code in a distributed environment like Databricks, performing file system operations such as unzipping the tar.gz file is not exactly straightforward. For instance, the file paths that would be used when dealing with an interactive cluster and a job cluster change slightly.

@MinuraPunchihewa, --conf-source works with the .tar.gz file (since Kedro 0.18.7 I believe) so there's no need to unzip the compressed file. Does that feature alleviate some of these concerns?

Adding remote cloud options to --conf-source (https://github.com/kedro-org/kedro/issues/3982) has been discussed by the team and ready to be implemented if you wanted to open a PR for this, please feel free! However, the option to include configuration in .whl file might warrant some discussion with the team. I will put this issue in our backlog and we'll add more info to this ticket on what we decide!

Thanks 💯

MinuraPunchihewa commented 1 week ago

Contributor

Thanks, @ankatiyar. I was not aware that the .tar.gz file could be directly passed to --conf-source. I will try doing this in my Databricks environment and check if it works.

Yes, sure. I will open a PR for passing in remote configurations.

Got it. Let me know what you decide.

noklam commented 1 week ago
  1. Adding remote support for conf_source make perfect sense to me.
  2. Packaging conf into the wheel. The problem of packagin conf is that it's not possible to change the configuration in code. The alternative is, move conf into src and change CONF_SOURCE inside settings.py, and this is supported already and require no change. I don't see a massive benefit of adding a flag to package conf into wheel.
datajoely commented 1 week ago

@MinuraPunchihewa a possible reference implementation can be found here where we do something similar for micropackaging (even if that feature is currently deprecated!)

https://github.com/kedro-org/kedro/blob/075d59b1776c585698c677ec3619bc30b15ea8bc/kedro/framework/cli/micropkg.py#L421

astrojuanlu commented 1 week ago

A few thoughts aside from #3982:

(1) About bundling configuration with the source code, I know that some users do this already. Managing the two separately is cumbersome.

For example, deploying on Airflow (and I bet in most other platforms) requires an extra step because of this https://docs.kedro.org/en/stable/deployment/airflow.html

(2) Some users just don't see the point of kedro package at all, we noticed that during the deployment research #4325

(3) We've long known that not all our configuration is created equal, see this discussion by @Galileo-Galilei #770 In particular, dataset types are intimately coupled with business logic (because they define the in-memory representation of the data) and arguably they are not "in the same league" as, say, model parameters.

(4) I think our interpretation of the 12factor app is quite idiosyncratic. From the text, it reads

This is a violation of twelve-factor, which requires strict separation of config from code.

But the subtitle of section III is

Store config in the environment

And further down it says

The twelve-factor app stores config in environment variables (often shortened to env vars or env).

The reluctance of Kedro against environment variables is well known, we almost only allow it for credentials by default #2623 (and that's not even enough for many use cases, see #1621, long discussion at #3811).


This is just a quick summary but long story short I think our approach to packaging and bundling needs a refresh.

datajoely commented 1 week ago

This is just a quick summary but long story short I think our approach to packaging and bundling needs a refresh.

I'd add the word pragmatic somewhere in there, but I agree wholeheartedly

Galileo-Galilei commented 1 week ago

I will add more thoughts on what I think we should do regarding this topic in #770 as promised to @astrojuanlu.

That said (and as @noklam said), notice that it is already very easy to implement in kedro>=0.19 (and kedro>=0.18.? if you enable OmegaConfigLoader) with the following procedure:

  1. Create an empty src/conf_app folder
  2. Copy the content of conf/base in src/conf_app
  3. Delete conf/base folder
  4. Update settings.py as follow:
# settings.py
from pathlib import Path

CONFIG_LOADER_ARGS = {
    "base_env": (Path(__file__).parents[1] / "conf_app").as_posix(),
    "default_run_env": "local",
}

Good news is, this does work even if you override partially CONF_SOURCE passed through the CLI (and this is likely a happy accident 🤯).

You can get much more detail and explanations in this demo repository.

PS: I haven't tried with kedro package but it should be packaged with the src folder (else just add include_data=True in the setup.py, see https://github.com/kedro-org/kedro/issues/1607)