Open MinuraPunchihewa opened 2 weeks ago
So we've resisted this for years since it violates the 12factor app the addition of --conf-source
is our official solution to this, but users still hit this bit of friction especially before they have proper CI/CD to do this neatly.
I'd be fully in favour of adding an explicit flag to package conf... but throw a warning on why we believe it to be bad practice and not fit for production.
So we've resisted this for years since it violates the 12factor app the addition of
--conf-source
is our official solution to this, but users still hit this bit of friction especially before they have proper CI/CD to do this neatly.I'd be fully in favour of adding an explicit flag to package conf... but throw a warning on why we believe it to be bad practice and not fit for production.
Hmm.. Yes, agreed. I think this might actually be the best solution to the problem? It would resolve the issues I am facing in a distributed environment as well.
I am happy to open a PR for both, but I just wanted to confirm if we won't to add an option to include the configuration in the package?
Moreover, when executing the code in a distributed environment like Databricks, performing file system operations such as unzipping the tar.gz file is not exactly straightforward. For instance, the file paths that would be used when dealing with an interactive cluster and a job cluster change slightly.
@MinuraPunchihewa, --conf-source
works with the .tar.gz
file (since Kedro 0.18.7 I believe) so there's no need to unzip the compressed file. Does that feature alleviate some of these concerns?
Adding remote cloud options to --conf-source
(https://github.com/kedro-org/kedro/issues/3982) has been discussed by the team and ready to be implemented if you wanted to open a PR for this, please feel free! However, the option to include configuration in .whl
file might warrant some discussion with the team. I will put this issue in our backlog and we'll add more info to this ticket on what we decide!
Thanks 💯
Contributor
Thanks, @ankatiyar. I was not aware that the .tar.gz
file could be directly passed to --conf-source
. I will try doing this in my Databricks environment and check if it works.
Yes, sure. I will open a PR for passing in remote configurations.
Got it. Let me know what you decide.
conf_source
make perfect sense to me.conf
into the wheel. The problem of packagin conf is that it's not possible to change the configuration in code. The alternative is, move conf
into src
and change CONF_SOURCE
inside settings.py
, and this is supported already and require no change. I don't see a massive benefit of adding a flag to package conf
into wheel.@MinuraPunchihewa a possible reference implementation can be found here where we do something similar for micropackaging (even if that feature is currently deprecated!)
A few thoughts aside from #3982:
(1) About bundling configuration with the source code, I know that some users do this already. Managing the two separately is cumbersome.
For example, deploying on Airflow (and I bet in most other platforms) requires an extra step because of this https://docs.kedro.org/en/stable/deployment/airflow.html
(2) Some users just don't see the point of kedro package
at all, we noticed that during the deployment research #4325
(3) We've long known that not all our configuration is created equal, see this discussion by @Galileo-Galilei #770 In particular, dataset types are intimately coupled with business logic (because they define the in-memory representation of the data) and arguably they are not "in the same league" as, say, model parameters.
(4) I think our interpretation of the 12factor app is quite idiosyncratic. From the text, it reads
This is a violation of twelve-factor, which requires strict separation of config from code.
But the subtitle of section III is
Store config in the environment
And further down it says
The twelve-factor app stores config in environment variables (often shortened to env vars or env).
The reluctance of Kedro against environment variables is well known, we almost only allow it for credentials by default #2623 (and that's not even enough for many use cases, see #1621, long discussion at #3811).
This is just a quick summary but long story short I think our approach to packaging and bundling needs a refresh.
This is just a quick summary but long story short I think our approach to packaging and bundling needs a refresh.
I'd add the word pragmatic somewhere in there, but I agree wholeheartedly
I will add more thoughts on what I think we should do regarding this topic in #770 as promised to @astrojuanlu.
That said (and as @noklam said), notice that it is already very easy to implement in kedro>=0.19
(and kedro>=0.18.?
if you enable OmegaConfigLoader
) with the following procedure:
src/conf_app
folderconf/base
in src/conf_app
conf/base
foldersettings.py
as follow:# settings.py
from pathlib import Path
CONFIG_LOADER_ARGS = {
"base_env": (Path(__file__).parents[1] / "conf_app").as_posix(),
"default_run_env": "local",
}
Good news is, this does work even if you override partially CONF_SOURCE
passed through the CLI (and this is likely a happy accident 🤯).
You can get much more detail and explanations in this demo repository.
PS: I haven't tried with kedro package
but it should be packaged with the src
folder (else just add include_data=True
in the setup.py
, see https://github.com/kedro-org/kedro/issues/1607)
Description
At the moment, when packaging a Kedro project using
kedro package
, two artifacts are created; a .whl file containing the source code and atar.gz
file consisting of the configuration inconf
directory.I understand that it is (in general) a best practice, however, from a user's point of view, these are two artifacts to be maintained. In my CI/CD pipelines as well as in the mechanism for running the code, I need to account for them, which makes the process a little more complicated.
Moreover, when executing the code in a distributed environment like Databricks, performing file system operations such as unzipping the
tar.gz
file is not exactly straightforward. For instance, the file paths that would be used when dealing with an interactive cluster and a job cluster change slightly.It would be great if it would be possible to include the conf folder as part of the
.whl
file (at the user's request). This would make the maintenance, installation and usage of the artifacts easier.Context
As I mentioned, this could go a long way in improving the usage of package Kedro projects, especially when running in distributed computing environments.
Possible Implementation
I suggest adding a flag to the
kedro package
command to allow users to choose whether to include the conf folder in the.whl
file. The default mechanism can be to avoid doing this.Possible Alternatives
A possible alternative might be to improve the documentation to explain how Kedro projects can be packaged and run in different systems (with examples), however, this might not be very extensible given the large number of different options that are available for running pipelines.