nipype / pydra

Pydra Dataflow Engine
https://nipype.github.io/pydra/
Other
120 stars 59 forks source link

Splitter unable to resolve combination of outer and element-wise products ([a x b] * c) #532

Open dafrose opened 2 years ago

dafrose commented 2 years ago

What version of Pydra are you using?

> pydra.__version__
'0.18'

also, interactive mode via jupyter notebook.

What were you trying to do?

I am trying to combine outer and element-wise products in a task or workflow splitter, where the result of an outer product is combined in an element-wise product with a single variable. In pydra syntax, this is: my_task.split((["a","b"],"c")). This can either happen, when each of the variable is explicitly defined by the user, or when the result of a previous split is recombined to do some group analysis and then split again the same way.

What did you expect will happen?

I expect that the splitter mytask.split((["a","b"],"c")) works for one or both of two variants to define the variables:

Variant 1: c is a one-dimensional list of the same length as (a x b) (flattened)

a = [1,2]  # len(a) == 2
b = [3,4,5]  # len(b) == 3
c = [6,7,8,9,10,11]  # len(c) == 6 == 2*3

Variant 2: c is a list of the same shape as (a x b)

a = [1,2]  # len(a) = 2
b = [3,4,5]  # len(b) = 3
c = [[6,7,8],[9,10,11]]  # shape(c) == (2,3) == shape(itertools.product(a,b))

Variant 1 would actually be preferred, since it corresponds to the pure-python version of this product:

> list(zip(product(a, b), c))
[((1, 3), 6),
 ((1, 4), 7),
 ((1, 5), 8),
 ((2, 3), 9),
 ((2, 4), 10),
 ((2, 5), 11)]

What actually happened?

In both cases, pydra failed to perform the split. Variant 1 raises a ValueError (correctly) pointing out, that the shapes of the two sides of the element-wise product don't agree. (Could be improved to point out the shapes that it is comparing, but probably works as intended.) Variant 2 throws the same error, although the shapes should actually agree.

Can you replicate the behavior? If yes, how?

Results from an interactive jupyter notebook:

# Preamble
import nest_asyncio
nest_asyncio.apply()
import pydra

@pydra.mark.task
@pydra.mark.annotate({"return": {"out_number": float,
                           }})
def sum_of_three(a: float = 0, b: float = 0, c: float = 0):
    result = a + b + c
    print(f"{a} + {b} + {c} = {result}")
    return result

Example 1: Working with no split

my_sum = sum_of_three(a=1,
                     b=2,
                     c=3)
my_sum()
my_sum.result()

Result:

1 + 3 + 5 = 9
Result(output=Output(out_number=9), runtime=None, errored=False)

Example 2: Working with all inner products

my_sum = sum_of_three(a= [1,2],
                     b=[3,4],
                     c=[5,6,7,8])
my_sum.split(["a","b","c"])
my_sum()
my_sum.result()

Result:

1 + 3 + 6 = 10
2 + 4 + 5 = 11
2 + 3 + 5 = 10
... skipping 5 results
[Result(output=Output(out_number=9), runtime=None, errored=False),
 Result(output=Output(out_number=10), runtime=None, errored=False),
 Result(output=Output(out_number=10), runtime=None, errored=False),
... skipping 5 results]

Example 3: Failing when length matches but not shape

my_sum = sum_of_three(a=[1,2],
                     b=[3,4,5],
                     c= [6,7,8,9,10,11])
my_sum.split((["a","b"],"c"))
my_sum()
my_sum.result()

Result:

ValueError                                Traceback (most recent call last)
/tmp/ipykernel_3684992/2447886248.py in <module>
      3                      c= [6,7,8,9,10,11])
      4 my_sum.split((["a","b"],"c"))
----> 5 my_sum()
      6 my_sum.result()

# [... skipping 8 entries]

/.../site-packages/pydra/engine/state.py in splits(self, splitter_rpn)
   1024                     if shape_L != shape_R:
   1025                         raise ValueError(
-> 1026                             f"Operands {term_R} and {term_L} do not have same shape"
   1027                         )
   1028                     newshape = shape_R

ValueError: Operands sum_of_three.c and (<itertools.product object at 0x7fe29d9be960>, (2, 3)) do not have same shape

This version seems to work as intended. However, I would argue, that the actual shape of the outer product [a,b] is not (2,3) but rather consider it as a parameter vector of length 6 (and 2 parameters per entry, but that shouldn't matter), which is thus the same shape or rather length as the parameter vector c. As I pointed out above, this corresponds to how the built-in function for element-wise recombinations zip works for this example. The following produces the desired combination of parameters:

> list(zip(product([1,2], [3,4,5]), [6,7,8,9,10,11]))
[((1, 3), 6),
 ((1, 4), 7),
 ((1, 5), 8),
 ((2, 3), 9),
 ((2, 4), 10),
 ((2, 5), 11)]

Example 4: Failing although shapes are correct.

my_sum = sum_of_three(a=[1,2],
                     b=[3,4,5],
                     c=[[6,7,8],[9,10,11]])
my_sum.split((["a","b"],"c"))
my_sum()
my_sum.result()

Result (same error):

ValueError                                Traceback (most recent call last)
/tmp/ipykernel_3690938/1846468932.py in <module>
      3                      c=[[6,7,8],[9,10,11]])
      4 my_sum.split((["a","b"],"c"))
----> 5 my_sum()
      6 my_sum.result()

# [... skipping 8 entries]

/.../lib/python3.9/site-packages/pydra/engine/helpers_state.py in splits(splitter_rpn, inputs, inner_inputs, cont_dim)
    486             if token == ".":
    487                 if shape["L"] != shape["R"]:
--> 488                     raise ValueError(
    489                         "Operands {} and {} do not have same shape.".format(
    490                             terms["R"], terms["L"]

ValueError: Operands sum_of_three.c and (<itertools.product object at 0x7f9fabaa3c80>, (2, 3)) do not have same shape.

Curiously, the shape test fails again, although they should be the same. The pure-python interpretation of this code fails to produce the desired output, which makes sense, because zip projects along the first dimension only (!):

> list(zip(product([1,2], [3,4,5]), [[6,7,8],[9,10,11]]))
[((1, 3), [6, 7, 8]), ((1, 4), [9, 10, 11])]

Here the first two entries of the product [a,b] are broadcast together with the two internal lists of c.

Conclusion

In my opinion, it makes the most sense to implement the element-wise splitter product the same way as zip works. This feels the most intuitive and corresponds to parameter variant 1/code example 3.

Preliminary Implementation Thoughts

The function pydra.engine.helpers_state.input_shape is used to compute the shape that is supposed to be compared. It is an interesting starting point, because

  1. It explicitly states # TODO: have to be changed for inner splitter (sometimes different length)
  2. It seems unnecessarily complicated. In my opinion, it makes sense to simply replace the entire function with numpy.shape and then think about how the outer product and element-wise (here: inner) product should be interpreted intuitively. In fact, I did exactly this as a dirty fix for the simple case, where a, b and c each contain only one element, but that alone does not solve the problem for larger lists.

The outer product already successfully use/mirror itertools.product. I think, the inner/element-wise product should simply be replaced by zip or a re-implementation of that behaviour and that's it. It might require some testing and careful thinking, but I recon, the entire split code in pydra.engine.helpers_state can be simplified a lot that way and become more stable at the same time.

Final thoughts

As always, thanks a lot for the work you have put into pydra. I think it shows a lot of promise, but still needs some kinks to be ironed out for productive use as a nipype replacement. As I have gathered extensive experience with debugging function-based workflows in the current version of pydra, I would like to contribute some insights and hopefully code, if time allows. I would like to hear your feedback, though, before attempting to put my thoughts into code.

dafrose commented 2 years ago

Another variant would be to use numpy.concatenate, e.g.

>import numpy as np
>from itertools import product
>np.concatenate((np.array(list(product([1,2], [3,4,5]))), [[6],[7],[8],[9],[10],[11]]), axis=1)
array([[ 1,  3,  6],
       [ 1,  4,  7],
       [ 1,  5,  8],
       [ 2,  3,  9],
       [ 2,  4, 10],
       [ 2,  5, 11]])

It required reformatting c into shape (6,1), but that could also be done automatically with numpy.reshape or similar. Gets rid of the additional dimension from the itertools.product.

djarecka commented 2 years ago

Thank you for the question. I was changing this several times, and didn't include the final version in the tutorial.. I will be improving the tutorials and will make sure to add it! So the second variant should work if you add cont_dim={"c": 2} to the task, see an example

The idea behind this was that sometimes we might want to treat 2D list as a list with elements that are lists (so 1D from the perspective of pydra inputs), but sometimes we want to treat as a 2d object, as you want here.

Regarding numpy, we tried to write pydra as light as possible, so we avoid packages that are not part of the standard library if possible.

dafrose commented 2 years ago

Thanks for your reply, @djarecka

So the second variant should work if you add cont_dim={"c": 2} to the task, see an example

ah I see. I did notice the cont_dim argument in the code, but wasn't able to figure out at a glance what it actually does or rather how/where it can be set.

The idea behind this was that sometimes we might want to treat 2D list as a list with elements that are lists (so 1D from the perspective of pydra inputs), but sometimes we want to treat as a 2d object, as you want here.

So with that in mind, it makes sense to use the cont_dim argument. However, wouldn't variant 1 be cleaner, i.e. always treating element-wise products the same way as zip works? Do you have a good example for a case, where a list should be treated as a 2D-object that could not also be accomplished with the flattened version of the same list? Otherwise, it would in my opinion make more sense to mimic the behaviour of zip because that seems more intuitive/clear in terms of explaining what the code does. At the same time, I haven't played around with not recombining every split, when there are multiple tiers of splits. In my code, I always get back flattened lists in the end that I split up again, as I need.

Regarding numpy, we tried to write pydra as light as possible, so we avoid packages that are not part of the standard library if possible.

I think, numpy is not necessary for this case, although I would argue that a dependency to numpy is not a bad thing ~, because while numpy will probably/hopefully never be part of the standard library, it is one the most widely used external libraries. If your nipype or pydra pipeline contains any computation in python, chances are, there will be numpy involved anyway. Only if you exclusively glue purely non-python external tools with pydra, you might get away with not downloading numpy. So if numpy simplifies/improves readability of pydra code, I would make a strong argument for it. But I guess, that is a discussion for another time and place...~

djarecka commented 2 years ago

I'm not saying that my examples couldn't be accomplished with the flattened version of the same list. I guess at some point I thought that explicit 2d version is better (at least don't remember that anything prevented me from adding the flatten version), but you might be right, that this would be cleaner...

We have to finish some other changes first, but later this month I would like to come back to the State class, so perhaps we can change it. I'll keep it open

dafrose commented 2 years ago

ok, sounds great! I guess, this is basically a design decision. The discussion about using numpy aside, I think the 1D/first dimension solution is a lot simpler to implement, because you only ever need to compare lengths in both the element-wise (zip) and the outer product (itertools.product). 2D or "nD" has the theoretical advantage that you can pass the output of tiered splits directly to a new split without using .combine() on everything, but you can achieve the same with flattened lists as long as you have access to the original splitter arguments that created this list. I think the latter should always be the case, or at least I can't think of a counter example. Also, higher dimensionality adds a lot more overhead in comparison to strict 1D. Error handling is probably also a lot more straight forward both from development perspective as well as user perspective.

satra commented 2 years ago

i think the challenge is with combining, and something needs to retain the shape of the original splitter operation. in the above example, technically one should be able to combine with either a or b or both, even after applying the scalar splitter c, whereas if c was 3x2, and flattening was allowed, one would be able to make this work via flattening, but combining would not make sense afterwards.

a = [1, 2]
b = [1, 2, 3]
c = [3, 4,5]
d = [3, 4]
.split(["a", "b"], ["c", "d"])

this split would work if flattened/zipped, but one would not be able to combine on the individual dimensions anymore. so perhaps there needs to be a conscious notion of when flatten is allowed and what it means in terms of remaining state after it is applied.

effigies commented 2 years ago

2D or "nD" has the theoretical advantage that you can pass the output of tiered splits directly to a new split without using .combine() on everything

Right, the idea was to be able to combine over any splitting dimension without affecting other (outer-joined) dimensions. The semantics of [a x b] * c are weird. If I combine across a, is the splitter now b * c' where c' is a set of lists of length len(a)? Do my tasks have access to that variable? Is it still called c?

I think this ambiguity applies even if c is a 2D array; we've now applied one name to two dimensions. I would suggest that in this case, we un-flatten to match the outer-joined dimensions, and the name c becomes inaccessible when any of the matched dimensions are combined.

satra commented 2 years ago

if i remember the implementation, one could combine over a, b, [a, b] or c as valid options after such a split. combining with any of the first two would leave the other dimension in state, while combining with any of the last two would leave no state.

i think the situation where such a split could arise is more of the pattern where c is really [a, b] through a different graph route and then gets paired later.

effigies commented 2 years ago

Apologies for the long post. We have two use cases being discussed:

1) Via @satra's comment, it is often the case c = f(a, b), and we want to ensure that c * [a x b] appropriately maps the indices of c onto the corresponding indices of a and b. In this context, What do do if c is flat? 2) In the original post, c has no relation to a and b, and it is inconvenient to explicitly reshape to match the dimensions of a and b when there is a clear ravel order (row-major) that is common to numpy and itertools.product.

I'm not convinced that case two justifies doing away with the nested structure, as it naturally provides some level of shape checking. But whether we flatten and track shape separately or unflatten inputs, I do think we can have a sensible semantics that allows for implicit association of flattened variables with multiple state dimensions.

To be annoying, I want to explore the idea with jagged state arrays, where the length of "b" depends on the value of "a". To make this concrete, let's suppose that we have two subjects, one with two runs and one with three. ```python @pydra.mark.task def get_runs(sub: int) -> ty.List[int]: if sub == 1: return [1, 2] elif sub == 2: return [1, 2, 3] subs = [1, 2] runs = get_runs(sub=subs).split("sub") # State # dims = ["sub"] # indices = [(0,), (1,)] # # Logical view: # [{"sub": 1}, {"sub": 2}] # get_files(sub: int, run: int) -> str files = get_files(sub=runs.sub, run=runs.out).split("run") # State # dims = ["sub", "run"] # indices = [[(0, 0), (0, 1)], # [(1, 0), (1, 1), (1, 2)] # # Logical view: # [[{"sub": 1, "run": 1}, {"sub": 1, "run": 2}], # [{"sub": 2, "run": 1}, {"sub": 2, "run": 2}, {"sub": 2, "run": 3}]] ````
Now if I want to add a new variable that might vary by subject and run: ```Python nested_mean_RTs = [[75, 79], [82, 79, 83]] flat_mean_RTs = [75, 79, 82, 79, 83] # doing_something(fname: str, mean_RT: int) -> ty.Any: doit = doing_something(fname=files.out, mean_RT=nested_mean_RTs).split((["sub", "run"], "mean_RT")) # State # dims = (["sub", "run"], "mean_RT") # indices = [[(0, 0), (0, 1)], # [(1, 0), (1, 1), (1, 2)] # # Logical view: # [[{"sub": 1, "run": 1, "mean_RT": 75}, {"sub": 1, "run": 2, "mean_RT": 79}], # [{"sub": 2, "run": 1, "mean_RT": 82}, {"sub": 2, "run": 2, "mean_RT": 79}, {"sub": 2, "run": 3, "mean_RT": 83}]] ```

I would say we could equally well use nested_mean_RTs or flat_mean_RTs, as long as it's understood that this means to un-flatten in row-major order with respect to ["sub", "run"]. Because it's a jagged list, we would need our own method for un-flattening, but I think the iteration order is clear.

However, suppose I had flattened mean RTs as colmajor_mean_RTs = [75, 82, 79, 79, 83]. Then I would specify that I will unflatten with respect to ["run", "sub"].

doit = doing_something(fname=files.out, mean_RT=flat_mean_RTs).split((["run", "sub"], "mean_RT"))

This should get us to the same logical view, but we will either need a reordered mean_RT list or an alternative indexer into the mean_RT list. Reordering would be destructive and possibly result in some counter-intuitive situations.

The case suggests a simple index proxy: ```Python class Indexer: vals : list idcs : list def __getitem__(self, idx): return self.vals[self.idcs[idx]] mean_RTs = Indexer(vals=colmajor_mean_RTs.copy(), idcs=[[0, 2], [1, 3, 4]]) ```

This would allow us to go in reverse, as well. If I .combine("mean_RT") then I could get back the flattened list instead of it being equivalent to .combine("a", "b").

I think this gets us to a semantically coherent situation that would allow us to reshape implicitly: unflattening in and flattening out. As said above, it's not obvious to me whether keeping track of the shape of flattened state will lead to simpler or more complicated code, but I think the result needs to be logically equivalent to this case, so reshaping and possibly reordering would need to be implemented on whatever data structure underlies state.

The final question I had was: what happens if we now .combine("run")?

doit.combine("run")
# State:
# dims = ["sub"]
# indices = [(0,), (1,)]
# 
# Logical view:
# [{"sub": 1}, {"sub": 2}]
#
# OR
#
# State:
# dims = [("sub", "mean_RT")]
# indices = [(0,), (1,)]
# 
# Logical view:
# [{"sub": 1, "mean_RT": [75, 79]}, {"sub": 2, "mean_rt": [82, 79, 83]}]

I would vote for the first. mean_RT was an alternate name for a pair of dimensions. Once the pair no longer exists, it would be less intuitive for the type referred to by the variable to change shape than for it to be inaccessible.

djarecka commented 2 years ago

Thanks for the comments. I will have to think a bit about all the examples. But after thinking about this for moment I do prefer to keep the idea of dimensions. Before allowing flattened version we would have to rethink the examples provided by Chris.

Just two general comments:

I should come back to create some good example to discuss before refactoring the State class just to refresh memory. Will be sure to include the ones we discuss here.

dafrose commented 2 years ago

Thanks from me as well. I understand the logic and intention of the current implementation a lot better now.

In my opinion, the current usage of cont_dim as argument to split should cover all relevant cases, even though the internal implementation still needs some tweeking. If I understand correctly, in order to make it easier to recreate the original ordering for usage with .combine(), you would prefer to keep the folded version of a list or even fold/unflatten a flat list. Flattening/reshaping can be reversed uniquely for n-dimensional lists as long as they can be described with a proper shape (no jaggedness). Thus, it wouldn't matter if you used the folded or flattened representation and the flattened version would arguably be simpler to implement (if you prefer not to use numpy). However, if jagged lists are allowed as in the example by @effigies , you need to know the original jagged structure of the list in order to reverse a flattening uniquely. So if you need to remember the original folding of the list anyway, it makes sense to simply keep it that way and possibly fold any flat lists accordingly.

At the same time, with jagged lists, it does not make sense anymore to compare (and compute) shapes. Instead, you could either compare lengths for each sub-list or simply try to match entries from both sides of a product operation until you hit an IndexError (or have a non-empty remainder on one side only) and then complain to the user. If you assume that the time it takes to compute the split is negligible in comparison to the wall time of an individual task, I would prefer "match until error". Otherwise it makes sense to check beforehand, but that is more complex to implement and costs extra computation time as well.

dafrose commented 2 years ago

I have a syntax question regarding this example by @effigies

doit = doing_something(fname=files.out, mean_RT=flat_mean_RTs).split((["run", "sub"], "mean_RT"))

Does the current implementation allow to use splitter arguments that have been used in a previous split and are not inputs to the current tasks (as in this example)? Or is that a simplification for the example?

effigies commented 2 years ago

It was a pretty major design decision to be able to create an inner join with existing state dimensions, so it should be implemented. That said, I might have gotten the syntax wrong; it might need to be ["get_files.run","get_runs.sub"] to be unambiguous.

dafrose commented 2 years ago

thanks @effigies . I haven't tried that myself yet. As I mentioned above, I simply recombine everything on workflow-level so I don't need to bother thinking about how this works... :-)

And as a follow-up as I tried to fix my own code: This line in pydra.engine.helpers_state.input_shape

386 if isinstance(value, list) and cont_dim > 0: ...

strictly enforces use of lists in splitters and fails for numpy.ndarray. This can be circumvented by the user by calling .tolist(), but the error that you get does not explain to the user what went wrong. Instead the array is treated as one-dimensional list and you get a shape mismatch error. Is this something that you would like a user to solve by themself (always make sure that inputs are lists and not arrays) or would you prefer to check within pydra that the input is indeed a list and not an array?

EDIT: I think part of the issue is that input_shape does not complain, if cont_dim has not been fully reduced to 0. I think this always means that something went different than expected and should throw an error.

I didn't post as separate issue, because it is related to the current dicussion and possible change to treatment of shapes, but I can repost, if you prefer.

dafrose commented 2 years ago

I refactored parts of a workflow to work with higher-dimensional input lists in a splitter and the cont_dim argument. I did get it to work with one folded list like so:

my_workflow.my_task.split((["in_1", "in_2", ("in_3a", "in_3b")], "in_4"), cont_dim=dict(in_4=3))

but failed in a more complicated case with two folded lists with different dimensions:

my_workflow.my_task2.split(([([("in_1a", "in_1b"), "in_2"], "in_3"), "in_4", ("in_5a", "in_5b")], "in_6"), 
    cont_dim=dict(in_6=4, in_3=2))

In this case, the splitter does not complain about dimensions, but something else that I don't understand yet goes wrong.

In my workflow, I process independent files from subjects + runs per subject. In order to compare variations to the processing, I also split up for some processing parameters. Somewhere I need to recombine all or some of the splits in order to asses a subject-level or group-level property, but I also want to split up again the same way afterwards. In that particular use case, I find it easier to wrap my head around .combine()-ing everything until I get a flat list rather than figuring out how to handle split results. After I did the analysis on the flattened lists, it would be very convenient to be to simply pass the flat lists alongside the multi-tier split and let pydra do the re-ordering. In particular, I do not change the order of the lists at all myself, so it should be straight-forward to recreate the original ordering.

I understand that the cont_dim argument to .split() does work somewhat as intended and once you fully grasp how states are passed around in pydra, it may actually be very powerful. However, would it be possible to also add a separate argument like unflatten_input=["my_input1", "my_input42"] to automatically reshape the input as needed, as long as to length fits the product of the dimensions that it needs to be cast with?

dafrose commented 2 years ago

I have tried to remove .combine() calls from a workflow and ran into the reason, why I put them there in the first place (and which leads to the discussion I opened up here): Say you have tasks A, B and C.

If I understand correctly, there is no way to selectively recombine splits for some follow-up tasks and not for others, is there? In that case, it would be very useful to be able to zip flat lists alongside outer products.

EDIT: A needs to be recombined, because any follow-up task to a split is automatically split the same unless it was recombined before.

satra commented 2 years ago

@dafrose - have you tried inserting a passthrough task whose only function is to combine? i vaguely remember discussing this with @djarecka in the past, but left it as an edge case that could potentially be handled by a simple function.

dafrose commented 2 years ago

Personally, I would prefer to avoid doing that with a passthrough task due to the additional boiler plate code. That said, what would be the expected syntax? I have experimented with that in the past, but without success. So I assumed it would not be supported.

satra commented 2 years ago

essentially something like this.

@pydra.mark.task
def passthrough(arg1):
    return arg1

passthroughtask = passthrough(arg1=prev_task.lzout.out).combine(splitter_info)

now the output of this task should have the combined state.

effectively you are asking for options to both combine and not combine and that's challenging because one branch is dependent on state and the other removes state.

dafrose commented 2 years ago

Thanks @satra for the suggestion. After some fiddling I got your example running. I needed to set up a workflow in order to make it work, because chaining tasks without a workflow did not seem to work. Is that expected?

EDIT: to be clear, usage of lazy fields via mytask.lzout.out did not work without workflow. It works, if you use mytask.result().output.out as input for the next task.

The example is a bit larger for improved clarity: ```python import nest_asyncio nest_asyncio.apply() # --> jupyter notebook import pydra # Separate function for split and combine to get the prints working. # Seemed to run only once, if I used the same function for both @pydra.mark.task def splitfunc(in_arg): print(f"Inside `splitfunc`: in_arg == {in_arg}\n") return in_arg @pydra.mark.task def combinefunc(in_arg): print(f"Inside `combinefunc`: in_arg == {in_arg}\n") return in_arg @pydra.mark.task def sum_(in_arg): import numpy as np output = np.sum(np.array(in_arg)) print(f"Inside `sum_`: in_arg == {in_arg}, out == {output}\n") return output wf1 = pydra.Workflow(name="wf1", input_spec=["a", "b"], a=[1,2,3]) wf1.add(splitfunc(name="passthrough_splitter", in_arg=wf1.lzin.a, )) wf1.passthrough_splitter.split("in_arg") wf1.add(combinefunc(name="passthrough_combiner", in_arg=wf1.passthrough_splitter.lzout.out, )) wf1.passthrough_combiner.combine("passthrough_splitter.in_arg") wf1.add(sum_(name="sum_split", in_arg=wf1.passthrough_splitter.lzout.out, )) wf1.add(sum_(name="sum_combined", in_arg=wf1.passthrough_combiner.lzout.out, )) wf1.set_output([("split_out", wf1.passthrough_splitter.lzout.out), ("combined_out", wf1.passthrough_combiner.lzout.out), ("summed_split", wf1.sum_split.lzout.out), ("summed_combined", wf1.sum_combined.lzout.out) ]) ```
Outputs: ```python > wf1(rerun=True) Inside `splitfunc`: in_arg == 1 Inside `splitfunc`: in_arg == 3 Inside `splitfunc`: in_arg == 2 Inside `combinefunc`: in_arg == 3 Inside `sum_`: in_arg == 3, out == 3 Inside `combinefunc`: in_arg == 1 Inside `sum_`: in_arg == 2, out == 2 Inside `combinefunc`: in_arg == 2 Inside `sum_`: in_arg == 1, out == 1 Inside `sum_`: in_arg == [1, 2, 3], out == 6 Result(output=Output(split_out=[1, 2, 3], combined_out=[1, 2, 3], summed_split=[1, 2, 3], summed_combined=6), runtime=None, errored=False) ```

Within a workflow it seems to work as expected. However, I needed to inspect the State objects in order to figure out, what to put into .combine(). Would it be possible to mention the available names of the splitter in the error message PydraStateError: all combiners have to be in the splitter? But yes, using a simple passthrough to combine a split on a single branch would be a viable solution.

satra commented 2 years ago

it's a bug if it doesn't work without a Workflow. i'll let @djarecka comment on both, since she is much more familiar with the internals.

satra commented 2 years ago

for a combiner or splitter exception, it would be useful to list the names of available states, which should be accessible given the inputs to the task.

djarecka commented 2 years ago

I understood that it works in a workflow (@dafrose please correct me if I'm wrong)

I'm not sure if I understand correctly the issue with a tasks outside workflows. Do you want to have two task, one with a splitter and one with a combiner, and do not have them in a workflow? I think this should fail, that's how combiner was created. You can combine only the fields that are in a splitter (either task or a workflow).

I agree, the exception message should be improved, will do it

dafrose commented 2 years ago

thanks @djarecka . Regarding chaining tasks outside a workflow: If I find the time, I'll look into it again and open a separate issue.

djarecka commented 2 years ago

Thank you!