Open noklam opened 1 year ago
What about this decorator syntax that ChatGPT once suggested to a user?
@astrojuanlu I once thought about this syntax too, but I hesitate at the end. What would be the pipeline.py
look like with this syntax? When functions are decorated as nodes, it's also less reusable across the project as you want to share the function but not the node, since input
, output
are project specific.
WDYT?
There's three separate things here I have opinions on, so let me do them one by one...
This seems like an oversight/minor bug to me. I don't immediately see any reason why we shouldn't allow this, so we probably should fix it.
I think this will have some problems, but before commenting I want to make sure I fully I understand the idea. Is this what you're suggesting, where a
, b
, c
and d
are each a dataset? Or something else?
def func(a: pd.DataFrame, bcd: List[pd.DataFrame]): ...
node(func, inputs=["a", ["b", "c", "d"]])
First of all, thank you for posting this because I hadn't seen it before and it is pretty amazing that ChatGPT came up with it.
As noted already, there's a problem here if you want to reuse the same function in multiple nodes and when you start using modular pipelines. These are not insurmountable with the decorator approach but it does get cumbersome.
That said, I actually think there's potentially a lot of value in the decorator method for simple projects. Just like auto-registration of pipelines, anything we can do to make it easier for people to make a simple project is good in my book. Certainly just slapping a node
decorator on a function is easier than needing to make a pipeline.py
file and probably sufficient in lots of cases.
In fact, a while ago there was a QB tool that did something similar and took the simplification even further:
inputs
mapping unless it's something other than the defaultoutputs
mapping unless it's something other than the defaultnode
so that you don't even need to write a pipeline.py file at allnode_xyz
and automatically make the pipeline based on thisI actually think there's a lot to be said for at least some aspects of this approach, although it won't ever replace the current system.
@AntonyMilneQB Yes you are correct.
It's an interesting idea, but I think it will be tricky to achieve and the additional complexity is probably not worth it for something that I think is not very commonly required.
node._run_with_*
methods)["a", ["b", ["c", "d"]]
. Flat only would be easier and probably handle 90% of the cases where this would be useful, but having the general case feels betteroutput
, which adds even more complexity*args/**kwargs
as in https://github.com/kedro-org/kedro/issues/1228#issuecomment-1033841117. The limitation here is that you could only have one set of *args
per function rather than multiple List
arguments, and if you've already got a function that accepts a list of dataframes then you need to change its signature to accept *args
instead. I know that re-writing is exactly what you're trying to avoid here, but it's quite a simple re-write for something that's not done very commonly I thinkIf we do want to do this then definitely worth looking at Dash's callback system, which does something very similar and enables arbitrary levels of nesting with both list and dictionary syntax.
@AntonyMilneQB Do you think this should go into Tech Design?
I want to clarify the 1st point. A list of outputs is possible, but it throws an error when there is only 1 output (at least for the case of pandas-iris
starter I test)
I would prefer it if it always just takes a list of input/output, what's the point if you need more than 1 dataset and suddenly you need to wrap it with a list? (I argue this should be the default in the starter as well, but feels a bit awkward in case of None
, no strong opinion here)
@AntonyMilneQB Do you think this should go into Tech Design?
Up to you really depending on how important you think it is. Personally I'd just leave it for now and see if anyone else seems to have feelings on it or mentions something similar. (Talking about points 2 and 3 above that is. We should do point 1.)
I want to clarify the 1st point. A list of outputs is possible, but it throws an error when there is only 1 output (at least for the case of
pandas-iris
starter I test)
Yeah, understood. I definitely think this is wrong.
I would prefer it if it always just takes a list of input/output, what's the point if you need more than 1 dataset and suddenly you need to wrap it with a list? (I argue this should be the default in the starter as well, but feels a bit awkward in case of
None
, no strong opinion here)
Not sure about this. Needing to wrap a single element in a list feels like a (small) unnecessary hoop that a user shouldn't have to jump through. Handling a single input/output already works ok so I don't see the need to remove that behaviour - it would be a very breaking change for little benefit. Changing starters to always put a single element in a list is more palatable but still feels kind of pointless to me since I don't think it's really better practice in any way.
This specific issue is just to make node
to accept inputs/outputs in a consistent manner, so that the list with just one output item won't break a pipeline run. Let's continue the discussion about the @node
decorator syntax here: https://github.com/kedro-org/kedro/issues/2471
Hi,
Just wanted to add some further considerations, though I'm not sure if this exactly matches the point of this issue:
There are some function signatures which accept an arbitrary number of positional arguments then additional keyword arguments. A commonly-used example in DS is sklearn train_test_split which takes (*arrays, test_size, shuffle,...)
and returns the same number of arrays (x2). I haven't been able to find a clean way to use this function with an arbitrary number of input datasets. For the usual usecase of splitting X and y it isn't super problematic as I can just write a wrapper which accepts X and y as named parameters and passes those into the actual function, but sometimes I have different datasets I want to split individually, so it isn't a clean solution. I had a discussion in the Slack some months back where we went over possible workarounds, but I feel it should be possible on the node definition side to match python function signatures with a cleaner separation between positional and keyword arguments.
@inigohidalgo Hey, thank you for the feedback. Is there anything in mind how we could improve it and how do you expect it works?
Hi, I'm not sure if this should be a priority, since Kedro has been working this way for a long time, and it's not a major pain-point, I brought it up because I saw the issue as I was browsing the issues list.
Since nodes are meant to be pure python functions, there's a bit of friction in the disparity between having to chose either a list OR a dictionary when passing inputs to a node, as in pure python functions you can do both at the same time (with *list
and **dict
). A possible solution (not super thought-out, and might have side-effects I'm not aware of) could be to add two further parameters to Nodes, args and kwargs, matching the python nomenclature, and then in the __init__
somehow combine those into the current input structure (using _dict_inputs_to_list
?).
Could also try to use similar syntax to the one Antony quoted above, but I think that could get messy and lead to unexpected behaviors:
def func(positional, *, kwarg1=None, kwarg2=None):
node(func, inputs=["a", {"kwarg1": "b", "kwarg2": "c"}])
@inigohidalgo Thank you for the comment, sorry for the late reply as I have missed this reply. Any chance you can still find the Slack thread in https://www.linen.dev/s/kedro/t/12327752/? The link you provided is expired.
but I feel it should be possible on the node definition side to match python function signatures with a cleaner separation between positional and keyword arguments.
And I agree with that, the mental model of using Kedro is always mapping some Python function to Kedro node, but there are subtle difference and cause some confusion.
Looks like the most immediate action point would be supporting outputs=["ds"]
as an equivalent to outputs="ds"
for symmetry with inputs
?
Description
Is your feature request related to a problem? A clear and concise description of what the problem is: "I'm always frustrated when ..."
Being a Kedro user, often you need to translate Python code to a Kedro node. There are some strange behaviors and potentials to make node resemble a Python function. This piece of code hasn't been touched for 4+ years.
The last one gives you this error. This is inconsistent and not a pleasant developer experience.
In addition, there are certain cases that
Kedro
doesn't support currently. Things like passing a list of data as a function argument?(edited, updated with @AntonyMilneQB example)
This also applies to the dictionary. Currently, this requires a workaround like changing the underlying function but it's not really a good solution.
Context
Why is this change important to you? How would you use it? How can it benefit other users? The more the kedro node resembles a normal Python function, the more flexibility it can provide. It's also easier to learn as the process is nearly just copy-pasting and string quote the argument list.
The 1st half should be possible. The 2nd part may be more difficult since the argument could be any Python object (i.e. a list-like object but not a native Python List). But even if just support native python type would be good enough (List & Dict).
Possible Implementation
(Optional) Suggest an idea for implementing the addition or change.
Possible Alternatives
(Optional) Describe any alternative solutions or features you've considered.