kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.47k stars 875 forks source link

Improve error message in `ModularPipelineError` raised within `_validate_datasets_exist()` #3953

Open yury-fedotov opened 2 weeks ago

yury-fedotov commented 2 weeks ago

Description

A nice feature of modular pipelines is that if you make a mistake in mapping inputs/outputs to what nodes need, it raises a ModularPipelineError with a helpful message containing a set of catalog items you missed:

Failed to map datasets and/or parameters onto the nodes provided: <what you didn't map>...
Did you mean <those> instead?

However I think this error message can be even more useful if we detail the mismatch more. For example, to distinguish this:

As far as I understand, those 2 scenarios would now lead to the same error message. However it seems possible to make this distinction based on inputs that go to _validate_datasets_exist() function that's responsible for raising this exception.

Context

I think it might improve developer experience while building modular pipelines.

Possible Implementation

Instead of inferring a single list of mismatches:

non_existent = (inputs | outputs | parameters) - existing

Infer it as two separate sets (that sum to non_existent) called something like redundant_inputs and missing_inputs. And configure the error message reflect the difference.