kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.53k stars 877 forks source link

Default node names are problematic #3575

Open astrojuanlu opened 5 months ago

astrojuanlu commented 5 months ago

Description

When a node has no name, the default is to use a string representation for it. For example:

'split_data([model_input_table; params: model_options]) -> [X_train;X _test;y_train;y_test]'

This poses a series of problems:

Context

I contend that 'split_data([model_input_table; params: model_options]) -> [X_train;X _test;y_train;y_test]' is a bad name because:

Plus all the problems discussed above.

Possible Implementation

Use func.__name__ as default node names.

deepyaman commented 5 months ago

Use func.__name__ as default node names.

Don't node names need to be unique? Even if not ideal, the current approach guarantees uniqueness, while the latter very frequently won't.

astrojuanlu commented 5 months ago

Correct: the current approach guarantees uniqueness under the current assumptions of "no two nodes can output the same dataset".

More strawman proposals:

More ideas?

noklam commented 5 months ago

It may be obvious, but can I ask why node name need to be unique? I feel like they exists for different purposes, I would like to map them out and see if there are opportunities to combine them. There would be at least 2 different purpose, one is for internal working, the other for human-readable.

Currently there are different names.

Noted some of this already exist in Node API, though it may not be what you expected.

def _get_readable_func_name(func: Callable) -> str:
    """Get a user-friendly readable name of the function provided.

    Returns:
        str: readable name of the provided callable func.
    """

    if hasattr(func, "__name__"):
        return func.__name__

Re: @deepyaman point, I think is fair for any medium size pipeline using namespace, but I think there are also many pipelines without repeated node function.

p.s. I am asking this just because I am lazy, I will do the digging at some point, but if someone already know that could save some time.

astrojuanlu commented 5 months ago

Agreed with the sentiment. I imagine a "node ID" should be unique, but not sure why we want names to be unique.

noklam commented 5 months ago

This is my understanding:

This definition is a bit leaky, because I will think combination of (func, inputs + outputs) is already an unique key. I test this and Kedro thinks they are valid:

I tend to say it's NO, they are the same.

Back to the original idea, _unique_key should be used as the "Node ID". You may be surprised that _validate_duplicate_nodes is using node.name to compare instead of node._unique_key. I do not understand why this is the case, but seems that a lot of the __eq__ , __lt__ was created for toposort, Kedro doesn't use this to compare node.

If we offload this responsibility to _unique_key, maybe we can use node.name to be something more human-readable, or even allow it to be non-unique. namespace.func is not a bad choice for default node name. The node name doesn't matter to Kedro, but only to the user.

What if user provide a name but there are multiple nodes return with the same name? This affect things such as %load_node <node_name>, kedro run --to-nodes <node_name>

2 options:

Con:

Pro:

Note that having duplicate names doesn't mean that you need to change the name immediately, you only have to do so if you have to specify it as an argument (smaller chance)

astrojuanlu commented 3 months ago

You may be surprised that _validate_duplicate_nodes is using node.name to compare instead of node._unique_key

Maybe this is a good place to start?

seems that a lot of the __eq__ , __lt__ was created for toposort, Kedro doesn't use this to compare node.

Now that we switched to graphlib #3728, would it be interesting to try to do some "tree-shaking" and see if we can remove (after deprecation) some of these methods?

noklam commented 3 months ago

The comment was mostly based on the investigation that I have done 2 months ago, they are documented a bit more in details here: https://noklam.github.io/blog/posts/default_node_name/2024-02-08-default-node-name.html

Now that we switched to graphlib https://github.com/kedro-org/kedro/pull/3728, would it be interesting to try to do some "tree-shaking" and see if we can remove (after deprecation) some of these methods?

We can definitely try this, but this wouldn't address the problem described in this issue, it may helps with removing the API surface.

The main proposal of https://github.com/kedro-org/kedro/issues/3575#issuecomment-1936693023 is that can we loosen up to make node.name non-unique, and offload that uniqueness checking to node._unique_key instead.