kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.49k stars 874 forks source link

Make credentials a resolver #3811

Open Galileo-Galilei opened 2 months ago

Galileo-Galilei commented 2 months ago

Description

Make "credentials" a OmegaConf resolver, so injecting credentials in the catalog would look like :

# current injection
my_data:
    type: pandas.CSVDataset
    filepath: /path/to/data.csv
    credentials: my_credentials # a string which refer to a a section in credentials.yaml
# new proposal
my_data:
    type: pandas.CSVDataset
    filepath: /path/to/data.csv
    credentials: ${credentials: my_credentials}

Context

This may look like a pure technical refactoring, but actually I am expecting this to be an actual improvement for users, e.g. :

Overall, this would likely be a general improvement with some overlaps with what is described here https://github.com/kedro-org/kedro/issues/1646 and has never been properly adressed, despite a bunch of discussions around this topic.

Possible Implementation

Move the logic from the catalog and the _get_credentials and _resolve_credentials:

https://github.com/kedro-org/kedro/blob/7237dc79e6b77475f6f1296f4fdd9d6aad3e1d11/kedro/io/data_catalog.py#L35-L58

https://github.com/kedro-org/kedro/blob/7237dc79e6b77475f6f1296f4fdd9d6aad3e1d11/kedro/io/data_catalog.py#L61-L83

to a resolver.

Possible Alternatives

Do nothing and keep the situation as is.

idanov commented 2 months ago

I like the idea in principle, but what would be a way for us to prevent hardcoding of credentials into the catalog directly?

Galileo-Galilei commented 2 months ago

Not sure of what you mean : for me the default credentials provided by kedro would do the exact same things as it does now (map a string in the catalog to an entry in credentials.yml) so I don't think there is a problem here.

A user who likes playing with kedro's limits could eventually write his own resolver and eventually takes raw inputs directly in the catalog, but I don't think it's kedro's responsibility to try to circumvent all weird things users can do (and funnily, they already can write a custom resolver and use it in the catalog - my proposal would simply enable to make their dangerous logic the default, but they are clearly doing it on purpose). Do you have an example of a situation this proposal would enable which is not already possible and that we want to avoid by all means?

astrojuanlu commented 2 months ago

actually I am expecting this to be an actual improvement for users

Hard agree. Our YAML-first approach to credentials is weird, given that there are many other methods that are arguably more prevalent, like environment variables, secrets vaults, JSON token files...

Plus we have evidence that users struggle to understand the current approach https://github.com/kedro-org/kedro/issues/1646#issuecomment-1703023713

datajoely commented 2 months ago

I think this is really elegant, I think there is probably a way where this is non-breaking and just an extension too

idanov commented 2 months ago

@Galileo-Galilei The way OmegaConf resolvers work is that they eventually render the value to the dictionary you have provided. That means that in the end result, you get the credentials as part of the key there. The consumer (in this case Kedro) would see no difference between this:

# new proposal
my_data:
    type: pandas.CSVDataset
    filepath: /path/to/data.csv
    credentials: ${credentials: my_credentials}

and this:

# new proposal
my_data:
    type: pandas.CSVDataset
    filepath: /path/to/data.csv
    credentials: s3cre7!

Currently, the latter will not be taken by Kedro and it won't work, because we do the resolving after loading the config. This way we are preventing users from accidentally checking in secrets. If we go the resolver path (which I like personally), we need to make sure that my second example will error out and your example will only work.

Galileo-Galilei commented 2 months ago

This is a totally valid concern, I did not think about this edge case.

It seems it should be possible to check if the credentials subkey is interpolated with [OmegaConf.is_interpolation method] (https://omegaconf.readthedocs.io/en/2.3_branch/usage.html#omegaconf-is-interpolation) and raise an error if it isn't, but it will likely require #2973 to be completed first (and given the number of references to this issue, it feels like it will become a blocker at one point)