kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.88k stars 897 forks source link

[KED-1108] Improve auto-generated node names #240

Closed lorenabalan closed 4 years ago

lorenabalan commented 4 years ago

Description & context

Users can specify names for their nodes to identify them more easily. When a name is not explicitly specified, Kedro auto-generates a default name. You can see this in the name property on Node. The current auto-generated name for a node looks something like this: func_name(inputs) -> outputs. (see implementation of __str__ method on the Node class)

This is a bit too descriptive and quite hard to type in the CLI to run a particular node (kedro run --node <node_name>). Our visualisation plugin, Kedro-Viz, is also no longer displaying this long name in the UI.

We can change it to something like '-'.join(sorted(outputs)). Outputs should be unique, which makes these names unique too. 

Actually just one output is enough, as they should all be unique.

If the node produces no outputs, we probably need to fall back to what we had before or similar.

Possible Implementation

One idea is to change the default behaviour so that instead it returns the underlying function name and a small hash, which is based on the inputs and outputs. This would now be perceived as a unique identifier.

Extra: Check that hover state on nodes in Kedro-Viz shows you the correct thing to be put into the command line kedro run --node.

SJDunkelman commented 4 years ago

Working on this

WaylonWalker commented 4 years ago

I would love to get rid of some of the verbosity from my repos, and not hand name each and every node, and only need to do so in the edge cases in which odd things occur.

nishnash54 commented 4 years ago

I was looking into this issue. @lorenabalan The current implementation in kerdo-viz shows the node.name as the node (display text) and on hover displays the node.func_name.

One idea is to change the default behaviour so that instead it returns the underlying function name and a small hash, which is based on the inputs and outputs. This would now be perceived as a unique identifier.

Do you propose setting this unique identifier as the node.name?

Extra: Check that hover state on nodes in Kedro-Viz shows you the correct thing to be put into the command line kedro run --node.

To implement this behavior, we will either have to change the kedro-viz hover option or change the node.func_name.

The current naming convention used contains a lot of useful information and can be store as node.info if required. A lot of tests need to be rewritten to accommodate these changes.

lorenabalan commented 4 years ago

Hey @nishnash54 , apologies for the late reply. Kedro-Viz currently makes use of Node.short_name and Node._func_name. We've been struggling with this idea for a while now and it feels like our API is a bit all over the place in terms of what is a name and what is an identifier, and how we use them in Node and Pipeline (unique_key, validate_node_duplicates, etc.), and we need to find the time to sit down and think about it as a whole. In the past we discussed that name/ID (not sure yet if they should be the same thing or different properties) should ideally marry 3 principles: unique, human readable, reasonably straightforward to reconstruct/deduce in your head, which makes this very hard. I don't think Node.func_name should change. There is also work happening on the Viz side of things, to display more information about the nodes and datasets, so we'll use that to feed into our decisions too.

lorenabalan commented 4 years ago

We've parked this for now to focus on other deliverables on our roadmap.