kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.47k stars 875 forks source link

Unpack dictionary parameters #3905

Open bpmeek opened 1 month ago

bpmeek commented 1 month ago

Description

In response to this ticket

Development notes

Can now unpack dictionaries in the inputs argument of a node, example below.

_unpack_params added to kedro/kedro/pipeline/node.py, this function iterates over node inputs and updates node.inputs if needed.

note: It is assumed that the dictionary entries and kwarg have the same name

Changes:

kedro/kedro/pipeline/modular_pipeline.py was updated to not throw errors when mapping dataset names. _is_single_parameter returns true if name starts with **params: _normalize_param_name does not append params: if the name already begins with **params: _validate_datasets_exist removes datasets that begin with ** from non_existent, I have confirmed that kedro/runner/runner.py will catch these missing datasets.

Undesirable behavior

Currently modular pipelines will still namespace the unpacked parameters, I'll need assistance with either updating the rules list or adjusting the rename function in a way that makes sense.

EDIT: This is no longer applicable after commit

Examples:

nodes.py

def split_data(data: pd.DataFrame, features: List[str], test_size, random_state) -> Tuple:
    X = data[features]
    y = data["price"]
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=test_size, random_state=random_state
    )
    return X_train, X_test, y_train, y_test

pipeline.py

node(
    func=split_data,
    inputs=["model_input_table", "**params:model_options"],
    outputs=["X_train", "X_test", "y_train", "y_test"],
    name="split_data_node",
),
model_options:
  test_size: 0.2
  random_state: 3
  features:
    - engines
    - passenger_capacity
    - crew
    - d_check_complete
    - moon_clearance_complete
    - iata_approved
    - company_rating
    - review_scores_rating

Developer Certificate of Origin

We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a Signed-off-by line in the commit message. See our wiki for guidance.

If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.

Checklist

astrojuanlu commented 1 week ago

Thanks for this PR @bpmeek! We'll review it shortly 🙏🏼