nipype / pydra

Pydra Dataflow Engine
https://nipype.github.io/pydra/
Other
119 stars 57 forks source link

Implicit lazy-field operation tasks #724

Open tclose opened 7 months ago

tclose commented 7 months ago

What would you like changed/added and why?

When connecting a lazy-out field to a downstream node, there are often minor operations to be applied to the field value, e.g. output_prefix = wf.lzin.subject_name + "_" or json_file=wf.lzin.input_niftix.json_file (see #723). When learning how to write Pydra workflows, users often intuitively expect that such operations should work and instead of needing to create custom function task.

I would like Pydra in such cases to automatically create a function task to perform the operation and insert it into the workflow transparently to the user.

What would be the benefit? Does the change make something easier to use?

The changes would be more

effigies commented 7 months ago

I don't think we should create full-on tasks that can be cached. If these operations are non-trivial, people can write tasks.

If you want to do this, I would start with an explicit model before making it implicit. I think we could provide a similar lazy interface as a LazyField that is a partial application of a function on LazyFields and/or values.

class LazyValue(Protocol):
    ...

class LazyField(LazyValue):
    ...

class Partial(LazyValue):
    def __init__(self, func, *args, **kwargs):
        self.func, self.args, self.kwargs = func, args, kwargs

    def apply(self):
        # Turn any LazyValues into values
        args, kwargs = resolve(self.args, self.kwargs)
        return self.func(*args, **kwargs)

    ...

The above examples would become:

import operator as op

output_prefix = Partial(op.add, wf.lzin.subject_name, "_")
json_file = Partial(op.getattr, wf.lzin.input_niftix, "json_file")

You could start adding to LazyField with something like:

class LazyField:
    def __add__(self, value):
        return Partial(op.add, self, value)

    def __getattr__(self, key):
        return Partial(op.getattr, self, key)

But where do you stop and why? Also, that __getattr__ makes me pretty nervous. And I can just see a linter-driven "cleanup" converting example one to output_prefix = f"{wf.lzin.subject_name}_" and nobody catching it because __str__ is always valid to call, and get output_prefix = "<LazyField at 0xABad1dea>_".

IMO explicit is better than implicit, here, and I think a clear model of these lightweight linking ops could be good. We could even have helpers like:

from functools import partial

Add = partial(Partial, op.add)
GetAttr = partial(Partial, op.getattr)
tclose commented 7 months ago

Yes, I was thinking that it wouldn't need to be a full blown task after I posted this. I like your Partial suggestion.

I was thinking to restrict __getattr__ calls to attributes/properties of the lazy-field type so they can be checked at construction time. Likewise we could only allow operators for field types that implement __add__, __mul__, ....

tclose commented 7 months ago

IMO explicit is better than implicit, here, and I think a clear model of these lightweight linking ops could be good. We could even have helpers like:

from functools import partial

Add = partial(Partial, op.add)
GetAttr = partial(Partial, op.getattr)

Not sure I get this, what is the explicit case in this example, using partial instead of an operator? One of the benefits as I see it is to align with user's expectations, and I have seen multiple new users expect to be able to do some basic operations on lazy fields

tclose commented 7 months ago

I'm not sure whether allowing __getattr__ on methods is a good idea or not. I can see a use case for some string manipulation methods, e.g.

wf.add(
    ATask(
        id=wf.lzin.a_field.replace('-', '_')
        ...

but agree that it will probably open a can of worms

effigies commented 7 months ago
GetAttr = partial(Partial, op.getattr)

Not sure I get this, what is the explicit case in this example, using partial instead of an operator? One of the benefits as I see it is to align with user's expectations, and I have seen multiple new users expect to be able to do some basic operations on lazy fields

This just shortens the boilerplate. If you're going to use this feature a lot, you might not want to use Partial(op.XYZ, ...) all the time.

output_prefix = Add(wf.lzin.subject_name, "_")
json_file = GetAttr(wf.lzin.input_niftix, "json_file")

Incidentally, I now think I'd prefer "Delayed" or something else besides "Partial", since we are waiting for arguments to be ready, not waiting for additional arguments.

I'm not sure whether allowing __getattr__ on methods is a good idea or not. I can see a use case for some string manipulation methods, e.g.

wf.add(
    ATask(
        id=wf.lzin.a_field.replace('-', '_')
        ...

I wonder if you could actually do this 100% by hacking __getattr__ and __call__, and let __add__ and replace come for free. You might end up with Delayed(Delayed(op.getattr, wf.lzin.a_field, replace), '-', '_').

but agree that it will probably open a can of worms

What if instead of doing it implicitly, we at least required people to acknowledge that they're not doing a normal expression: Magic(wf.lzin.a_field).replace('-', '_'). (Could do something like Connect or Delay.)

tclose commented 7 months ago

what about wf.lzin.a_field.delayed_value.replace('-', '_') ?

effigies commented 7 months ago

IMO that's harder to parse and understand where the magic is happening. If you'll accept extra syntax, I don't see why a weird attribute would be better than a weird function call.

tclose commented 7 months ago

Fair enough, would be happy with Delay I think.

This discussion has made me think that calling them "lazy" fields might be slight misnomer, as I would typically expect lazy fields to be populated when I access them, not at some specified later time. Not sure what I would have called them though