kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
10.04k stars 908 forks source link

Make `node` to accept input/outputs in a consistent manner #2408

Open noklam opened 1 year ago

noklam commented 1 year ago

Description

Is your feature request related to a problem? A clear and concise description of what the problem is: "I'm always frustrated when ..."

Being a Kedro user, often you need to translate Python code to a Kedro node. There are some strange behaviors and potentials to make node resemble a Python function. This piece of code hasn't been touched for 4+ years.

           # This is OK
            node(
                func=make_predictions,
                inputs="X_train",  # passing a string
                outputs="y_pred", 
                name="make_predictions",
            ),

            # This is OK
            node(
                func=make_predictions,
                inputs=["X_train"],  # passing a list of string as input
                outputs="y_pred",
                name="make_predictions",
            ),

            # This is not OK
            node(
                func=make_predictions,
                inputs=["X_train", "X_test", "y_train"],
                outputs=["y_pred"],  # passing a list of string as output
                name="make_predictions",
            ),

The last one gives you this error. This is inconsistent and not a pleasant developer experience.

ValueError: Failed to save outputs of node make_predictions: 
make_predictions([X_train,X_test,y_train]) -> [y_pred].
The node definition contains a list of outputs ['y_pred'], whereas the node 
function returned a 'Series'.

In addition, there are certain cases that Kedro doesn't support currently. Things like passing a list of data as a function argument?

(edited, updated with @AntonyMilneQB example)

def func(a: pd.DataFrame, bcd: List[pd.DataFrame]): ... node(func, inputs=["a", ["b", "c", "d"]])

This also applies to the dictionary. Currently, this requires a workaround like changing the underlying function but it's not really a good solution.

Context

Why is this change important to you? How would you use it? How can it benefit other users? The more the kedro node resembles a normal Python function, the more flexibility it can provide. It's also easier to learn as the process is nearly just copy-pasting and string quote the argument list.

The 1st half should be possible. The 2nd part may be more difficult since the argument could be any Python object (i.e. a list-like object but not a native Python List). But even if just support native python type would be good enough (List & Dict).

Possible Implementation

(Optional) Suggest an idea for implementing the addition or change.

Possible Alternatives

(Optional) Describe any alternative solutions or features you've considered.

astrojuanlu commented 1 year ago

What about this decorator syntax that ChatGPT once suggested to a user?

image

noklam commented 1 year ago

@astrojuanlu I once thought about this syntax too, but I hesitate at the end. What would be the pipeline.py look like with this syntax? When functions are decorated as nodes, it's also less reusable across the project as you want to share the function but not the node, since input, output are project specific.

WDYT?

antonymilne commented 1 year ago

There's three separate things here I have opinions on, so let me do them one by one...

1. @noklam original point that you can't pass a list of strings as output

This seems like an oversight/minor bug to me. I don't immediately see any reason why we shouldn't allow this, so we probably should fix it.

2. @noklam idea that you should be able to pass a list of inputs

I think this will have some problems, but before commenting I want to make sure I fully I understand the idea. Is this what you're suggesting, where a, b, c and d are each a dataset? Or something else?

def func(a: pd.DataFrame, bcd: List[pd.DataFrame]): ...
node(func, inputs=["a", ["b", "c", "d"]])

3. Decorator syntax posted by @astrojuanlu

First of all, thank you for posting this because I hadn't seen it before and it is pretty amazing that ChatGPT came up with it.

As noted already, there's a problem here if you want to reuse the same function in multiple nodes and when you start using modular pipelines. These are not insurmountable with the decorator approach but it does get cumbersome.

That said, I actually think there's potentially a lot of value in the decorator method for simple projects. Just like auto-registration of pipelines, anything we can do to make it easier for people to make a simple project is good in my book. Certainly just slapping a node decorator on a function is easier than needing to make a pipeline.py file and probably sufficient in lots of cases.

In fact, a while ago there was a QB tool that did something similar and took the simplification even further:

I actually think there's a lot to be said for at least some aspects of this approach, although it won't ever replace the current system.

noklam commented 1 year ago

@AntonyMilneQB Yes you are correct.

antonymilne commented 1 year ago

It's an interesting idea, but I think it will be tricky to achieve and the additional complexity is probably not worth it for something that I think is not very commonly required.

If we do want to do this then definitely worth looking at Dash's callback system, which does something very similar and enables arbitrary levels of nesting with both list and dictionary syntax.

noklam commented 1 year ago

@AntonyMilneQB Do you think this should go into Tech Design?

I want to clarify the 1st point. A list of outputs is possible, but it throws an error when there is only 1 output (at least for the case of pandas-iris starter I test)

I would prefer it if it always just takes a list of input/output, what's the point if you need more than 1 dataset and suddenly you need to wrap it with a list? (I argue this should be the default in the starter as well, but feels a bit awkward in case of None, no strong opinion here)

antonymilne commented 1 year ago

@AntonyMilneQB Do you think this should go into Tech Design?

Up to you really depending on how important you think it is. Personally I'd just leave it for now and see if anyone else seems to have feelings on it or mentions something similar. (Talking about points 2 and 3 above that is. We should do point 1.)

I want to clarify the 1st point. A list of outputs is possible, but it throws an error when there is only 1 output (at least for the case of pandas-iris starter I test)

Yeah, understood. I definitely think this is wrong.

I would prefer it if it always just takes a list of input/output, what's the point if you need more than 1 dataset and suddenly you need to wrap it with a list? (I argue this should be the default in the starter as well, but feels a bit awkward in case of None, no strong opinion here)

Not sure about this. Needing to wrap a single element in a list feels like a (small) unnecessary hoop that a user shouldn't have to jump through. Handling a single input/output already works ok so I don't see the need to remove that behaviour - it would be a very breaking change for little benefit. Changing starters to always put a single element in a list is more palatable but still feels kind of pointless to me since I don't think it's really better practice in any way.

merelcht commented 1 year ago

This specific issue is just to make node to accept inputs/outputs in a consistent manner, so that the list with just one output item won't break a pipeline run. Let's continue the discussion about the @node decorator syntax here: https://github.com/kedro-org/kedro/issues/2471

inigohidalgo commented 1 year ago

Hi,

Just wanted to add some further considerations, though I'm not sure if this exactly matches the point of this issue:

There are some function signatures which accept an arbitrary number of positional arguments then additional keyword arguments. A commonly-used example in DS is sklearn train_test_split which takes (*arrays, test_size, shuffle,...) and returns the same number of arrays (x2). I haven't been able to find a clean way to use this function with an arbitrary number of input datasets. For the usual usecase of splitting X and y it isn't super problematic as I can just write a wrapper which accepts X and y as named parameters and passes those into the actual function, but sometimes I have different datasets I want to split individually, so it isn't a clean solution. I had a discussion in the Slack some months back where we went over possible workarounds, but I feel it should be possible on the node definition side to match python function signatures with a cleaner separation between positional and keyword arguments.

noklam commented 1 year ago

@inigohidalgo Hey, thank you for the feedback. Is there anything in mind how we could improve it and how do you expect it works?

inigohidalgo commented 1 year ago

Hi, I'm not sure if this should be a priority, since Kedro has been working this way for a long time, and it's not a major pain-point, I brought it up because I saw the issue as I was browsing the issues list.

Since nodes are meant to be pure python functions, there's a bit of friction in the disparity between having to chose either a list OR a dictionary when passing inputs to a node, as in pure python functions you can do both at the same time (with *list and **dict). A possible solution (not super thought-out, and might have side-effects I'm not aware of) could be to add two further parameters to Nodes, args and kwargs, matching the python nomenclature, and then in the __init__ somehow combine those into the current input structure (using _dict_inputs_to_list?).

Could also try to use similar syntax to the one Antony quoted above, but I think that could get messy and lead to unexpected behaviors:

def func(positional, *, kwarg1=None, kwarg2=None):

node(func, inputs=["a", {"kwarg1": "b", "kwarg2": "c"}])
noklam commented 1 year ago

@inigohidalgo Thank you for the comment, sorry for the late reply as I have missed this reply. Any chance you can still find the Slack thread in https://www.linen.dev/s/kedro/t/12327752/? The link you provided is expired.

but I feel it should be possible on the node definition side to match python function signatures with a cleaner separation between positional and keyword arguments.

And I agree with that, the mental model of using Kedro is always mapping some Python function to Kedro node, but there are subtle difference and cause some confusion.

inigohidalgo commented 1 year ago

This is the thread :)

astrojuanlu commented 1 year ago

Looks like the most immediate action point would be supporting outputs=["ds"] as an equivalent to outputs="ds" for symmetry with inputs?