kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.49k stars 874 forks source link

[Configuration Management] Allow config in config or config in code. #3807

Closed sheldontsen-qb closed 2 months ago

sheldontsen-qb commented 2 months ago

Description

Was just thinking that it would be great to let users pick between config in code vs config in yaml. What I mean by this is:

input_obj = KedroDataset(...) 
output_obj = KedroDataSet(...)

@dataclass
class MyParam:
   a = ...
   b = ...
my_param = MyParam()

Pipeline(
   [
      Node(func, ["input", "params:my_param", "output") # currently string, how it works right now
      Node(func, [input_obj, my_param], output_obj) # actual objects
   ]
)

It's not too many lines of code changed (I monkeypatched a few files just to check), but would also add another dimension of flexibility on how to use kedro. Giving option back to users to decide how they would like to manage their configuration.

Context

Right now I cannot use dataclasses to manage my configuration, which is a reasonable pattern to want to use. I've seen preferences of fully leveraging IDE, vs the mindset that there is clear separation between parameters vs code. I believe kedro should offer this type of flexibility to end users while still prescribing a preferred default.

Possible Implementation

I did monkeypatch a few files and got a small prototype working awhile back. Basically where kedro does a catalog.load for the string. For starters, we can allow the following:

    for name in node.inputs:
        hook_manager.hook.before_dataset_loaded(dataset_name=name, node=node)

        if isinstance(name, string):
           inputs[name] = catalog.load(name)   # original catalog.load(value) logic then pass to the node
        elif isinstance(name, AbstractDataset)
          inputs[str(name)] = name.load()  # call the dataset load method
        elif isinstance(name, dataclass)
          inputs[str(name)] = name            # pass directly to the node
        elif isinstance(name, Any)
          inputs[str(name)] = name            # tbh no difference from top unless we want to do something custom
        else:
           raise Exception("only string, AbstractDataSet, dataclass allowed for now?") # not sure if we want limiting or not

Changes can be made here: https://github.com/kedro-org/kedro/blob/44817b8d1d26ef3cf2b6b3cd00b40dfada458816/kedro/runner/runner.py#L494

Note the last elif basically means anything goes through, so I can define any object in any function used by a node and it cleanly gets passed through. Limiting to dataclasses also seems incomplete.

To make it flexible you could move this mapping of object to load_func to settings.py so users can always handle this themselves instead of putting if..elif in the kedro codebase.

Possible Alternatives

N/A

datajoely commented 2 months ago

Prototype from @limdauto a few years ago along the same lines: https://web.archive.org/web/20210921071139/https://kedrozerotohero.com/experiments/define-data-catalog-using-python

datajoely commented 2 months ago

@benhorsburgh also had a similar idea about marshaling / unmarshaling parameters with Pydatntic

sheldontsen-qb commented 2 months ago

Yes, this could also at the same time be used to unpack parameters into a kedro node so users have a way to avoid writing:


def my_func(df, p):
    p1 = p.get("p1")
    ...
astrojuanlu commented 2 months ago

Thanks for opening this @sheldontsen-qb ! This has come up a few times so it's good to start collecting some use cases. I'm moving this to Discussions to also link it to a similar one we had not so long ago https://github.com/kedro-org/kedro/discussions/3788