Syntax changes to enable static type-checking of workflows

nipype / pydra

Pydra Dataflow Engine

https://nipype.github.io/pydra/

Other

120 stars 59 forks source link

Syntax changes to enable static type-checking of workflows #670

Open tclose opened 1 year ago

tclose commented 1 year ago

What would you like changed/added and why?

Static type-checkers, such as mypy, which analyse code at design time (i.e. before it runs), can be very useful in catching errors while you are writing. However, workflow construction in Pydra is currently too magic for them to be of much use and they tend to throw up lots of false positives. Given strong typing is a key feature of Pydra (particularly if #662 is accepted) it seems like it would be disappointing to users if this typing information isn't able to be utilised by static type-checkers like typical code.

While it won't be possible to get type-checkers to parse Pydra tasks/workflows as they are currently, I think we could get them to work with a few minor-ish adaptations along the lines of what I have already proposed in #655.

What would be the benefit?

Pydra workflows could be type-checked at design time, e.g. by editor-integrated linters, giving designers immediate feedback if they have attached the wrong outputs together or have format mismatches between tools they are linking.

I will expand on the changes I think would be necessary/sufficient in the comments below

tclose commented 1 year ago

Given that inputs are passed to the initialisation of the task in the canonical workflow construction pattern, in order for static type-checkers to follow what is happening, the task inputs would need to be attributes of the object added to the workflow.

Therefore, I have been playing around with the idea of merging the input and output specs into a single class, along with the func (function tasks) or executable (shell tasks) attributes, where the output spec is a nested class. For example, following on from the pattern proposed in #655 I was thinking something like


@shell.task("mycmd")
class MyShellTask(SpecBase["MyShellTask.Out"]):

    in_file: Nifti = shell.arg(argstr="--in", position=0)
    a_flag: bool = shell.arg("--aflag", ...)

    @shell.outputs
    class Out:

        out_file: NiftiGz = shell.out(argstr="--out", file_template="{in_file}_out.nii.gz")
        scalar: float = shell.out(callable=...)

in this case @shell.task calls attrs.define under the hood, likewise shell.(arg|out) call attrs.field. The only wrinkle is having to include the parameterised SpecBase base class, eg SpecBase["MyShellTask.Out"], in addition to the decorator, but we could at least check for it inside the decorator.

Function tasks could still be created conveniently using the existing @mark.task syntax, or dynamically using a constructor function as is currently the case. However, if you wanted to support type checking (or use extended attrs functionality) for a Python function that you are wrapping, you could use a similar extended syntax, e.g.


def myfunc(in_file: Nifti, y: float) -> float:
    return x / y

@func.task(myfunc)
class MyFuncTask(SpecBase["MyFuncTask.Out"]):

    x: int = func.arg(allowed_values=[1, 2, 3])
    y: float = func.arg()

    @func.outputs
    class Out:

        z: float = func.out()

NB: @mark.task and dynamic constructors would actually create classes of this form for consistency with the extended syntax.

Tasks defined in this form are then be added to a workflow using the Workflow.add() method as is the case now, with the slight difference that you would be passing an instantiated spec object rather than a fully-fledged task object. The Task object would then be returned by the Workflow.add() method (instead of the workflow self object as is currently the case) and the lzout of this returned object would be used to connect to subsequent tasks e.g.


wf = Workflow(name="my workflow", input_spec=["a_file"])

my_shell = wf.add(
    MyShellTask(
        in_file=wf.lzin.a_file,
        a_flag=True
    )
)

my_func = wf.add(
    MyFuncTask(
        in_file=wf.lzin.a_file,
        y=my_shell.lzout.scalar
    )
)

The reason you would do it like this is so you can set the signature of the Workflow.add() method to be


    def add(self, spec: T[U]) -> TaskBase[T, U]:
        ...

where T and U are used in the Task base to set the types of inputs and lzout respectively.


class TaskBase(Generic[T, U]):

    @property
    def inputs(self) -> T:
        ...

    @property
    def lzout(self) -> U:
        ...

Note that in this case typing lzout as U is actually a white lie as its attributes will actually be LazyField types, not the actual types used by the task (e.g. int, float, str, File, etc...).

tclose commented 1 year ago

Here is a prototype with mock classes that works with mypy (but not pylance unfortunately)

import typing as ty
from pathlib import Path
import inspect
from typing import Any
from copy import copy
from typing_extensions import dataclass_transform
import attrs
from fileformats.generic import File, Directory
from pydra.engine.task import TaskBase, FunctionTask, ShellCommandTask
from pydra.engine.core import Result

OutSpec = ty.TypeVar("OutSpec")

class SpecBase(ty.Generic[OutSpec]):

    __pydra_task_class__: ty.Type[TaskBase]
    __pydra_cmd__: ty.Union[str, ty.Callable]

    def __call__(self, name: ty.Optional[str] = None, **kwargs) -> TaskBase:
        return self.__pydra_task_class__(self.__pydra_cmd__, name=name, inputs=self, **kwargs)

InSpec = ty.TypeVar("InSpec", bound=SpecBase)

@attrs.define
class LazyOut:
    name: str
    spec: type

    def __getattr___(self, field_name):
        try:
            field = self._fields[field_name]
        except KeyError as e:
            raise AttributeError(
                f"Lazy output interface of {self.name} task does not include "
                f"{field_name}"
            ) from e
        return LazyOutField[field.type](self.name, field_name, field.type)

    @property
    def _fields(self):
        return attrs.fields(self.spec)

@attrs.define
class LazyOutField:
    task_name: str
    field_name: str
    type: type

@attrs.define
class LazyIn:
    name: str
    spec: type

    def __getattr___(self, field_name):
        try:
            field = self._fields[field_name]
        except KeyError as e:
            raise AttributeError(
                f"Lazy output interface of {self.name} task does not include "
                f"{field_name}"
            ) from e
        return LazyInField[field.type](self.name, field_name, field.type)

    @property
    def _fields(self):
        return attrs.fields(self.spec)

@attrs.define
class LazyInField:
    task_name: str
    field_name: str
    type: type

@attrs.define(auto_attribs=False)
class Task(ty.Generic[InSpec, OutSpec]):
    inputs: InSpec = attrs.field()
    name: str = attrs.field()

    @name.default
    def name_default(self):
        return type(self).__name__.lower()

    @property
    def lzout(self) -> OutSpec:
        return ty.cast(OutSpec, LazyOut(self.name, self.inputs.__pydra_output_spec__))

@attrs.define
class Workflow:
    name: str
    tasks: ty.List[Task] = attrs.field(factory=list)
    connections: ty.List[ty.Tuple[str, LazyOutField]] = attrs.field(factory=list)
    input_spec: ty.Dict[str, type] = attrs.field(factory=dict)

    def add(
        self, spec: SpecBase[OutSpec], name: ty.Optional[str] = None
    ) -> Task[SpecBase[OutSpec], OutSpec]:
        task = Task[SpecBase[OutSpec], OutSpec](
            spec, **({"name": name} if name else {})
        )
        self.tasks.append(task)
        return task

    def set_output(self, connection):
        self.connections.append(connection)

    @property
    def lzin(self):
        return LazyIn(self.name, self.input_spec)

def shell_arg(
    default=attrs.NOTHING,
    factory=None,
    argstr="",
    position=None,
    help=None,
    converter=None,
    validator=None,
):
    return attrs.field(
        default=default,
        factory=factory,
        converter=converter,
        validator=validator,
        metadata={"argstr": argstr, "position": position, "help_string": help},
    )

def shell_out(
    default=attrs.NOTHING,
    argstr="",
    position=None,
    help=None,
    filename_template=None,
    callable=None,
    converter=None,
    validator=None,
):
    return attrs.field(
        default=default,
        converter=converter,
        validator=validator,
        metadata={
            "argstr": argstr,
            "position": position,
            "output_file_template": filename_template,
            "help_string": help,
            "callable": callable,
        },
    )

@dataclass_transform(kw_only_default=True, field_specifiers=(shell_arg,))
def shell_task(executable: str):
    def decorator(klass):
        klass.__pydra_cmd__ = executable
        klass.__pydra_task_class__ = ShellCommandTask
        klass.__annotations__["__pydra_executable__"] = str
        return attrs.define(kw_only=True, auto_attrib=True, slots=False)(klass)

    return decorator

@dataclass_transform(kw_only_default=True, field_specifiers=(shell_out,))
def shell_outputs(klass):
    return attrs.define(kw_only=True, auto_attrib=True, slots=False)(klass)

def func_arg(
    default=attrs.NOTHING,
    factory=None,
    help=None,
    converter=None,
    validator=None,
):
    return attrs.field(
        default=default,
        converter=converter,
        validator=validator,
        metadata={"factory": factory, "help_string": help},
    )

@dataclass_transform(kw_only_default=True, field_specifiers=(func_arg,))
def func_task(function: ty.Callable):
    def decorator(klass):
        klass.__pydra_cmd__ = staticmethod(function)
        klass.__pydra_task_class__ = FunctionTask
        nested_out_specs = [
            n for n, s in klass.__dict__.items() if hasattr(s, "__pydra_spec_class__")
        ]
        if not nested_out_specs:
            raise AttributeError(f"No nested output specs found in {klass}")
        elif len(nested_out_specs) > 1:
            raise AttributeError(
                f"More than one output specs found in {klass}: {nested_out_specs}"
            )
        klass.__pydra_output_spec__ = nested_out_specs[0]
        klass.__annotations__["__pydra_cmd__"] = ty.Callable
        return attrs.define(kw_only=True, auto_attrib=False, slots=False)(klass)

    return decorator

def func_out(
    default=attrs.NOTHING,
    help: ty.Optional[str] = None,
    converter=None,
    validator=None,
):
    return attrs.field(
        default=default,
        converter=converter,
        validator=validator,
        metadata={
            "help_string": help,
        },
    )

@dataclass_transform(kw_only_default=True, field_specifiers=(func_out,))
def func_outputs(klass):
    return attrs.define(kw_only=True, auto_attrib=True, slots=False)(klass)

tclose commented 1 year ago

You would then define shell specs as psuedo "dataclasses"


@shell_task("mycmd")
class MyShellCmd(SpecBase["MyShellCmd.Out"]):
    in_file: File = shell_arg(argstr="", position=0)
    an_option: str = shell_arg(
        argstr="--opt",
        position=-1,
        help="an option to flag something",
        converter=str,  # Not necessary, just showing off a potential benefit of defining fields as full attrs.fields
    )
    out_file: ty.Optional[ty.Union[Path, str]] = None  # Not necessary as would be automatically inserted from output spec, only included to allow static full type-checking (but output_file_template fields are not typically used in workflows in any case)

    @shell_outputs
    class Out:
        out_file: File = shell_out(
            filename_template="{in_file.stem}_out{in_file.suffix}"
        )
        out_str: str

or alternatively


@shell_outputs
class MyShellOut:
    out_file: File = shell_out(
        filename_template="{in_file.stem}_out{in_file.suffix}"
    )
    out_str: str

@shell_task
class MyShellCmd(Interface[MyShellOut]):

    executable = "cmd"
    in_file: File
    an_option: str = shell_arg(argstr="--opt")

For function specs you could either use the longer form that enables static type-checking, help-strings and attrs converters/validators, i.e.


def func(in_int: int, in_str: str) -> ty.Tuple[int, str]:
    return in_int, in_str

@func_task(func)
class MyFunc(SpecBase["MyFunc.Out"]):
    in_int: int = func_arg(help="a dummy input int", validator=attrs.validators.gt(0))
    in_str: str = func_arg(help="a dummy input str", converter=str)  # converter isn't necessary here, it just means that all inputs will be attempted to be converted

    @func_outputs
    class Out:
        out_int: int
        out_str: str

or stick with the more concise for simple self-explanatory function tasks


@pydra.mark.task(returns=["out_int", "out_str"])
def func(in_int: int, in_str: str) -> ty.Tuple[int, str]:
    return in_int, in_str

tclose commented 1 year ago

THIS IS OUTDATED, SEE SLIGHTLY UPDATED SYNTAX BELOW, https://github.com/nipype/pydra/issues/670#issuecomment-1663204523

which would be used in a workflow like


wf = Workflow(name="myworkflow", input_spec={"in_file": File})

mytask = wf.add(MyFunc(in_int=1, in_str="hi"))

mytask2 = wf.add(
    MyFunc(
        in_int=mytask.lzout.out_int,  # should be ok
        in_str=mytask.lzout.out_int,  # should show up as a mypy error because `in_str` expects a str not an int
    ),
    name="mytask2",  # Tasks can be optionally given a name to differentiate multiple tasks from the same spec class (otherwise defaults to name of spec class) 
)

mytask3 = wf.add(
    MyShellCmd(
        in_file=wf.lzin.in_file,
        an_option=mytask2.lzout.out_str,
        out_file="myfile.txt",
    )
)

wf.set_output(("out_str", mytask2.lzout.out_str))

Note that Workflow.add() returns a newly added task instead of the workflow object. I remember that @ghisvail was in favour of keeping it return the Workflow object so you can chain add ops, but I couldn't think of a way to get the static type-checking to work with this.

tclose commented 1 year ago

The other main syntax difference to the end user (i.e. non interface designers) is that the function/shell wrappers are separated from the task classes, so they don't return tasks to be run, i.e.

spec = MyFunc(in_int=1, in_str="hi")  #  is not a TaskBase object

So I wasn't quite sure how best to support running tasks outside of a workflow.

Option 1

Would it be best to have SpecBase.__call__() return a Task object, i.e.

spec = MyFunc(in_int=1, in_str="hi")  #  is not a Task object
task = spec(name="mytask")  # is a Task object
result = task(plugin="serial")  # run the task object

which would be conceptually clean but more verbose

Option 2

or automatically do steps 2 and 3 under the hood to keep behaviour close to what it is currently, i.e.

spec = MyFunc(in_int=1, in_str="hi")  #  is not a Task object
result = spec(name="mytask", plugin="serial")  # create and run a Task object in one step

The only issue with this second option is that task object gets hidden and is hard to access afterwards, and the TaskBase.split() and TaskBase.combine() also need to be translated to the SpecBase class in a way that they transparently create a task object and split/combine it and then return (thereby cluttering the spec namespace).

satra commented 1 year ago

@tclose - thank you for all the work you have put into #662 and the thoughts here. this latter issue of wrapping functions is important to pydra, where the functions come from any package out there. one should be able to use an arbitrary function with as much ease as a fully decorated one. this intent was supposed to be true of shell commands as well (see the nextflow syntax of wrapping shell scripts).

a goal of pydra was to provide efficient caching and parallelization to any script in eager mode and to any scripter who can put one together. this addressed one change that we made in nipype at the beginning (the separation of interfaces and nodes). in pydra once a task is defined it can be called with any inputs and any split/combine settings. the other intentional change was that a workflow behaves the same way as a task (again not the case in nipype). both allow "calling" them, running in parallel, and caching. and this brings me to all the back and forth we did between dataclasses and attrs and the main culprit: pickling.

here were the principles/features we were considering (i think there is an early issue out there somewhere)

simplicity of use without workflows (get a non python coder to quickly adopt it - this was very important)
reduce boilerplate whenever possible
any computable object is a task (combines interfaces/node/workflow into one concept)
task object reuse in a script just like one makes multiple calls to a function
universal caching (multiple workflows can share the same cache directory, or use multiple caches)
message passing (also becomes a way to capture provenance)
parallelization (split/combine in it's full syntactic form and at different levels with nested stuff)
distributed execution (where pickling comes in, and still not solved for functions in two similar environments)
workflow construction through assignment rather than the connect syntax of nipype
container support (this has evolved into run in environment X, where a container is an environment)

from my perspective the question is not so much whether static type checking or a dataclass like syntax should be there, but more whether the simplest operations can be done in the simplest manner possible so anyone can use it as they would do most functions in a notebook. thus, while thinking about the syntax of things, it may be useful to see how the pydra tutorial would change given the changes proposed and what it would take a "scripter" not a python coder to use pydra easily.

some comments:

regarding returning task vs workflow with .add it's a matter of object use. that task object cannot be executed at that stage (it has lazy fields), which is why the current syntax of input assignment uses the workflow object to refer to outputs of tasks in a workflow. the current syntax would create a dangling object that's outside the scope of the workflow. it may also make people want to assign tasks defined elsewhere.

on the MyShellCmd class above:

str is used twice (once as type annotation and once in the converter arg)
out_file in Out is explicit, whereas in current pydra this would be implicit. i.e. the template would have been added to out_file when first defined and internally it would generate a File type output as a result.
out_file takes in two possible types ( Path or str ). shouldn't it always be Path-like-str? and related to the second note, this was a nipype related enhancement to make it possible for people to put the output in any location, but also enable an implicit output in the working directory of the task. (this feature was class dependent in nipype).

currently any task once defined can be executed with a function call, or in a distributed execution form. there is a fundamental difference in the way a Task behaves currently and what is being proposed here. another reason to consider the tutorial to see if the simplest things are still possible.

tclose commented 1 year ago

Thanks for the detailed feedback @satra!

@tclose - thank you for all the work you have put into #662 and the thoughts here. this latter issue of wrapping functions is important to pydra, where the functions come from any package out there. one should be able to use an arbitrary function with as much ease as a fully decorated one. this intent was supposed to be true of shell commands as well (see the nextflow syntax of wrapping shell scripts).

I had a look at the nextflow syntax, but wasn't quite sure which part you were referring to. In general, I agree that it is a really important feature, as being forced to create new interfaces in Nipype was a big barrier.

I think there is scope to further streamline the use of arbitrary functions/shell commands through helper functions, which I was planning to include as part of this work.

a goal of pydra was to provide efficient caching and parallelization to any script in eager mode and to any scripter who can put one together. this addressed one change that we made in nipype at the beginning (the separation of interfaces and nodes).

The "eager mode" is the only issue that will be impacted on with what I have proposed above, as addressed in my last post above.

in pydra once a task is defined it can be called with any inputs and any split/combine settings. the other intentional change was that a workflow behaves the same way as a task (again not the case in nipype).

I would agree that the unification of the task and workflow behaviour in combination with split/combine is very important as it enables dynamic iteration patterns that weren't possible with Nipype, this was definitely a big selling point.

both allow "calling" them, running in parallel, and caching. and this brings me to all the back and forth we did between dataclasses and attrs and the main culprit: pickling.

I have found the cloudpickle package you guys settled on to be surprisingly robust, is there a use case where it falls over?

here were the principles/features we were considering (i think there is an early issue out there somewhere)

simplicity of use without workflows (get a non python coder to quickly adopt it - this was very important)

Only the issue raised above

reduce boilerplate whenever possible

I would argue the proposed dataclass syntax for task specs feels like a reduction in boilerplate. However, it is perhaps just a different, more standardised, boilerplate. But for coders familiar with dataclasses (or are interested in becoming so), I think it would result in much more readable code.

any computable object is a task (combines interfaces/node/workflow into one concept)

What I am suggesting would split the interfaces and nodes again, but I'm not sure I'm seeing all the drawbacks with this. I would have thought that as long as nodes and workflows are both unified as "tasks" then you would get the benefits you are looking for with it.

task object reuse in a script just like one makes multiple calls to a function

Is this possible currently? Doesn't each task object have it's own state preventing it from being reused?

Initialising parameterised specs would actually facilitate this if I understand what you mean.

universal caching (multiple workflows can share the same cache directory, or use multiple caches)

message passing (also becomes a way to capture provenance)

parallelization (split/combine in it's full syntactic form and at different levels with nested stuff)

distributed execution (where pickling comes in, and still not solved for functions in two similar environments)

workflow construction through assignment rather than the connect syntax of nipype

container support (this has evolved into run in environment X, where a container is an environment)

These features wouldn't be impacted

from my perspective the question is not so much whether static type checking or a dataclass like syntax should be there, but more whether the simplest operations can be done in the simplest manner possible so anyone can use it as they would do most functions in a notebook. thus, while thinking about the syntax of things, it may be useful to see how the pydra tutorial would change given the changes proposed and what it would take a "scripter" not a python coder to use pydra easily.

For minimal impact on the tutorial you could just change Cell 3 in the 2_intro_functiontask notebook (and similarly Cell 2 in 3_intro_functiontask_state) from

task1 = add_var(a=4, b=5)

task1 = add_var(a=4, b=5)()

or perhaps

task1 = add_var(a=4, b=5)(cache_dir="/home/me/Desktop/pydra-work")

to make it a bit more explicit that the second call is configuring the state whereas the first just parameterises it. Alternatively, if this feels too circuitous if you went with Option 2 you could leave Cell 3 the way it is and just change Cell 8 & 10 to refer to the returned result rather than the result attribute of the task (which would now be hidden).

some comments:

regarding returning task vs workflow with .add it's a matter of object use. that task object cannot be executed at that stage (it has lazy fields), which is why the current syntax of input assignment uses the workflow object to refer to outputs of tasks in a workflow. the current syntax would create a dangling object that's outside the scope of the workflow. it may also make people want to assign tasks defined elsewhere.

Isn't this already the case if a task is initialised before it is added, e.g.


wf = Workflow(name="myworkflow", input_spec={"in_file": File})

mytask = MyShellCmd(in_file=wf.lzin.in_file, an_option="hi")

wf.add(mytask)
...

If you wanted to avoid this you could have the Workflow.add() method return just the Lzout interface, e.g.


wf = Workflow(name="myworkflow", input_spec={"in_file": File})

func_lz = wf.add(MyFunc(in_int=1, in_str="hi"))

func2_lz = wf.add(
    MyFunc(
        in_int=func_lz.out_int,
        in_str=func_lz.out_str, 
    ),
    name="mytask2",
)

...

I would be quite attracted to this syntax, as I'm always forgetting to add the .lzout when referring to upstream outputs.

The only issue would be that you couldn't do the split/combine on the same line as the workflow add, as neither the return value or the arg to add() will be the task object. But I think I would find this a bit cleaner too, because I find the line gets very busy when you specify a split/combine and it would be easier to read on a separate line, e.g.


wf = Workflow(name="myworkflow", input_spec={"in_file": File})

func_lz = wf.add(MyFunc(in_str="hi"), name="mytask")
wf.mytask.split(in_int=[1, 2, 3])

...

on the MyShellCmd class above:

str is used twice (once as type annotation and once in the converter arg)

the str in the converter arg is optional (I included it to demonstrate a benefit of the expanded syntax), and changes the behaviour from raising an error if a non-string is passed to accepting any inputs that can be converted to a str (i.e. anything).

out_file in Out is explicit, whereas in current pydra this would be implicit. i.e. the template would have been added to out_file when first defined and internally it would generate a File type output as a result.

My plan to set it up so that out_file is implicitly added to the input spec from the output spec (i.e. the opposite direction than it is currently). This way the output spec could be of File type (or more specific format e.g. NiftiGz), since it will refer to an existing FS object, and the implicitly input field would have type Path as it will refer to a path to be created.

The only reason to explicitly include out_file in the input spec would be to stop the static type-checker complaining. However, that wouldn't be an issue in most workflows as they would typically just leave it write the output to the default template path.

out_file takes in two possible types ( Path or str ). shouldn't it always be Path-like-str? and related to the second note, this was a nipype related enhancement to make it possible for people to put the output in any location, but also enable an implicit output in the working directory of the task. (this feature was class dependent in nipype).

You are right, Path would be best, because strings would be automatically coerced to paths and fail if not path-like. The static type-checker would complain but as above this is not a big issue.

currently any task once defined can be executed with a function call, or in a distributed execution form. there is a fundamental difference in the way a Task behaves currently and what is being proposed here. another reason to consider the tutorial to see if the simplest things are still possible.

I would argue that the proposed changed aren't fundamental, just that the creation of task object is delayed a step from the parameterisation in some way. I agree that it is not ideal, but feel like it is a bit of trade-off to streamline some things down the line.

tclose commented 1 year ago

EDIT: I realised that by splitting parameterisation from configuration, that workflow construction would be creating a light-weight object and Workflow.add(spec: Spec) -> Node method call could add a Node object, which only gets converted into a Task object once the workflow is configured, something like

@attrs.define
class Node(ty.Generic[In, Out]):
    inputs: In  # contains a reference to the task type to instantiate
    lzout: Out
    _splitter: ty.Union[list, tuple, str]
    _workflow: Workflow

    def split(...):
        ...

    def combine(...):
        ...

which would allow us to get very close to the existing syntax (only the return type of add changes from Workflow -> Node)


wf = Workflow(input_spec={"in_files": list[File], "in_int": int}, output_spec={"out_str": str, "out_files": list[File]})  # defining the output spec is not necessary, but will cause the outputs to be type-checked

myfunc = wf.add(
    MyFunc(in_int=wf.lzin.in_int, in_str="hi")
)
if blah:
    myfunc.inputs.in_str = "hello"

myfunc2 = wf.add(
    MyFunc(
        in_int=myfunc.lzout.out_int,  # should be ok
        in_str=myfunc.lzout.out_int,  # should show up as a mypy error because `in_str` expects a str not an int
    ),
    name="myfunc2",  # Tasks can be optionally given a name to differentiate multiple tasks from the same spec class (otherwise defaults to name of spec class) 
)

myshellcmd = wf.add(
    MyShellCmd(
        an_option=myfunc2.lzout.out_str,
        another_option=wf.myfunc2.lzout.out_int,  # can still access via wf.* if preferred
        out_file="myfile.txt",
    )
).split(in_file=wf.lzin.in_files)  # Note method call on the outer parentheses, not inner like current

wf.lzout.out_str = myfunc2.out_str   # This is an independent proposal I have just thrown in, which could be an alternative to set_output
wf.set_output(("out_files", myshellcmd.lzout.out_file))

This would also address @satra 's concern about having dangling tasks that users would think they should be able to run

tclose commented 1 year ago

EDIT: I'm leaning towards the more streamlined execution paradigm, where the task configuration (i.e. cache-locations, environments, etc...) is combined in the execution step as opposed to the 3-stage solution (noting that in the current syntax the configuration is combined with the "parameterisation" stage). The main issue I had with this paradigm was how to implement the split and combine methods, but I think these would work pretty well as kwargs during the configuration/execution stage (NB: they would still be defined using the Node.split() & Node.combine() methods during workflow construction)

Execution process

# Parameterisation of the interface, only contains parameters
myfunc = MyFunc(in_int=1, in_str="hi")

# Configuration & execution of the interface in a single step
result = myfunc(cache_dir="/path/to/cache", environment=Docker("myfuncdocker:1.0"), plugin="cf")

and

# Workflow definition (typically in a different module)
preprocess_workflow = Workflow(input_spec=["in_files", "in_int"])
preprocess_workflow.add(...)
...
preprocess_workflow.set_output(...)

# Workflow parameterisation step
my_preprocess = preprocess_workflow(in_files=["/path/to/file1", "/path/to/file2"], in_int=2)

# Execution step
result = my_preprocess(cache_dir="/path/to/cache", plugin="cf")

In order to be able to access the task instance (not to be confused with the instantiated Interface, e.g. myfunc), it could be set as an attribute of the result, i.e.

print(result.task.output_dir)

If you wanted to split or combine the task/workflow, you would provide the splitter/combiner as a kwarg at execution

alternative_preprocess_workflow = Workflow(input_spec=["in_file", "threshold"])
...

# Workflow parameterisation step
my_preprocess = alternative_preprocess_workflow(threshold=0.5)

result = my_preprocess(plugin="cf", inputs={"in_file": ["/path/to/file1", "/path/to/file2"]}, split="in_file")

You could also allow the following simplified syntax for simple outer-only splits (which would be 99% of cases for splits at the execution stage I imagine)

result = my_preprocess(plugin="cf", split={"in_file": ["/path/to/file1", "/path/to/file2"]})

Alternative 3-stage process

Note that the alternative I was considering was the 3-step process from initialisation->configuration->execution, but think this is a bit too verbose:

myfunc = MyFunc(in_str="hi")
myfunc_task = myfunc(cache_dir="/path/to/cache", environment=Docker("myfuncdocker:1.0"))
myfunc_task.split(in_int=[1, 2, 3])
result = myfunc_task(plugin="cf")

Note for consistency we would probably want to add another layer to configuring workflows, so the workflow above would be executed as

my_preprocess = preprocess_workflow(in_files=["/path/to/file1", "/path/to/file2"])
preprocess_task = my_preprocess(cache_dir="/path/to/cache").split(in_int=[2, 3, 4])
result = preprocess_task(plugin="cf")

tclose commented 1 year ago

Summary of proposed BW-compatibility-breaking syntax changes

Interface design

Described in detail here

@task decorator would be maintained
All task interfaces would become dataclass-like classes (function tasks are currently functions that return FunctionTask instances)
Dynamic construction of task interfaces would be created from JSON/YAML-like objects

Workflow construction

Described in detail here

Workflow.add(Task) -> Workflow signature would become Workflow.add(Interface) -> Node, where Node is a new light-weight placeholder object (i.e. couldn't be executed as a task)
Interfaces don't accept the name or any other kwarg, just the params
- Node names (relative to the workflow) would default to the name of the interface to avoid having to explicitly name the node in addition to the variable it is assigned to.
- Explicit names would need to be provided if the same interface is used twice, but would be passed as a kwarg to the add() method instead.
Split/combines would be performed on Node objects, not Interface objects, e.g. wf.add(MyNode(...)).split(...) instead of wf.add(MyNode(...).split(...))

Task/workflow execution

Described in detail above

The "configuration" of a task (e.g. cache locations & environments) would be moved from the __init__() (leaving it a pure parameterisation) to the execution stage, i.e. __call__()
- Implicitly created "task" instances can be accessed as attribute of results after the execution, e.g. result = interface(); print(result.task.output_dir)
- splitting and combining of the top level object would become kwargs of the __call__ method
Parameterisation of tool "interfaces" results in an Interface instance rather than a Task. Likewise "parameterisation" of workflows would result in a stateless WorkflowParameterisation instance, which would be interchangeable with tool Interface instances.

Pros

Static type-checking and code-completion.
Separates parameterisation namespace from configuration option namespace
- i.e. name, audit_flags, cache_dir, cache_locations, inputs, cont_dim, messenger_args, messengers, rerun, input_spec, output_spec, propagate_rerun
- Would allow us to freely add new configuration options without potentially breaking workflows, i.e. no risk of clashes between the new option and an interface input name someone has defined.
Instantiating "interface" classes returns "less-surprising" dataclass-style object
Workflows become light-weight stateless objects that can be reused at various points in multiple "meta-workflows"
- e.g. I have taken to writing methods that return workflows in order to artificially delay the construction of the workflow so I can use it at multiple points with separate params/states (e.g. format converter workflows).
- You avoid the slightly confusing syntax options you have now where you can define the input_spec of a workflow and provide inputs in a single step, or split the workflow before it is constructed
Avoid having to use "wf." prefix (although nodes could still be accessed that way)
...

Cons

Obscure the task object from the user as an attribute of the result.
Variable names nodes that are assigned to wouldn't necessarily match name of node in workflow (i.e. wf.node-name)
...

effigies commented 1 year ago

Hi Tom, thanks for this thorough and thoughtful proposal. I have an overall gut reaction which is "How are we going to teach this?" I would really like to preserve (or create, where absent) the ability to progress in complexity, instead of making people learn a bunch of Pydra concepts to get going.

I want to think through how a user might progress from a first task or workflow to a fully-fledged one, targeting the kind of well-defined structures you're proposing, but with minimal changes in syntax as each new idea gets added.

I have predictably not been able to get through all of this in the hour I set myself, and my thoughts haven't coalesced into a direct response, but in case you're interested, I've been writing them up here: https://demo.hedgedoc.org/lu5bWq1mRxm14jSH9Is5bA# I'll need more time to figure out how far these are from what you're doing.

Unfortunately, this is a bit of a scrambly time for me, so I'm not sure how much time I can devote to getting through this before next week. I'll try to find a couple more hours.

tclose commented 1 year ago

Hi Tom, thanks for this thorough and thoughtful proposal. I have an overall gut reaction which is "How are we going to teach this?" I would really like to preserve (or create, where absent) the ability to progress in complexity, instead of making people learn a bunch of Pydra concepts to get going.

This sounds like a good idea, I like how you are laying it out as a tutorial in your thoughts. A big part of my motivation for trying to get static type-checking/code-completion to work is to give users some more guard rails when designing workflows.

I want to think through how a user might progress from a first task or workflow to a fully-fledged one, targeting the kind of well-defined structures you're proposing, but with minimal changes in syntax as each new idea gets added.

👍

I have predictably not been able to get through all of this in the hour I set myself, and my thoughts haven't coalesced into a direct response, but in case you're interested, I've been writing them up here: https://demo.hedgedoc.org/lu5bWq1mRxm14jSH9Is5bA# I'll need more time to figure out how far these are from what you're doing.

I think we probably just need to have a preliminary discussion about it next week. We can walk through my thinking and everyone's first thoughts then maybe agree on some general goals/acceptance criteria of any refactor.

Unfortunately, this is a bit of a scrambly time for me, so I'm not sure how much time I can devote to getting through this before next week. I'll try to find a couple more hours.

No worries, I won't really have any time to implement anything beyond what I have already committed to to finish of the typing/hashing for a while, so we can take our time.

ghisvail commented 1 year ago

I have an overall gut reaction which is "How are we going to teach this?"

Same. Also keeping in mind I am giving a lengthy tutorial in a few weeks and I would be quite upset if most of it turn up deprecated within the next releases :-/

A big part of my motivation for trying to get static type-checking/code-completion to work is to give users some more guard rails when designing workflows.

The more I use typing in Python the more I am convinced this is not where the value of it is. If guard rails were that important, we would be using a type safe language in the first place (so definitely not Python).

Just the fact that you can define a record type with dataclass with proper typing for all fields, and yet instantiate this record with anything, is enough evidence to me that spending too much effort on static typing for the sake of type checking / safety is not worth the tradeoff with user friendliness / ease of use of the surface API.

Now, something type annotations proved to be useful for is code generation. For instance, dataclasses automatic derivation of constructor, pretty printer or value equality. attrs validators, typer command-line parsers, and so on. These are definitely a net gain to users who no longer have to write mechanical pieces of boilerplate.

I would really like to preserve (or create, where absent) the ability to progress in complexity, instead of making people learn a bunch of Pydra concepts to get going.

I would very much favor this user-first approach to design. First thing may be to formalize what we assume our typical user should know before using and / or contributing to Pydra. For instance, I did a quick lunch survey at my lab recently and very few people use, know or even care about typing in Python.

Besides, something we should definitely spend a decent amount of efforts on is the sad path. Error messages like AttributeError: 'Input' object has no attribute 'files_hash' or AttributeError: type object 'Workflow' has no attribute 'state' are quite off-putting, especially when the library lets you construct this workflow as if it were a valid one.

Guiding principles such as "make invalid state unrepresentable" and "avoid primitive obsession" maybe help us reach the appropriate level of abstraction. I believe that's where @tclose is leading us towards, I am just not sure betting everything on type checking is the right angle to it.

Happy to discuss it further in the next meeting if I can make it.

tclose commented 1 year ago

I have an overall gut reaction which is "How are we going to teach this?"

Same. Also keeping in mind I am giving a lengthy tutorial in a few weeks and I would be quite upset if most of it turn up deprecated within the next releases :-/

I understand, although feedback from your tutorial series would be the perfect input into this discussion... ;)

A big part of my motivation for trying to get static type-checking/code-completion to work is to give users some more guard rails when designing workflows.

The more I use typing in Python the more I am convinced this is not where the value of it is. If guard rails were that important, we would be using a type safe language in the first place (so definitely not Python).

Perhaps guard-rails is the wrong term, "guide rails" maybe?

I'm imagining a significant fraction of users who won't necessarily write their own tool interfaces but only put-together/tweak-existing workflows, for who code-completion, if not full type-checking, would be a benefit.

Just the fact that you can define a record type with dataclass with proper typing for all fields, and yet instantiate this record with anything, is enough evidence to me that spending too much effort on static typing for the sake of type checking / safety is not worth the tradeoff with user friendliness / ease of use of the surface API.

Note that with the merge of my typing-PR, this won't be the case for Pydra from the next release. If types are specified for task input/output fields and they don't match what is connected to them, Pydra will raise errors (but not if they aren't specified).

Now, something type annotations proved to be useful for is code generation. For instance, dataclasses automatic derivation of constructor, pretty printer or value equality. attrs validators, typer command-line parsers, and so on. These are definitely a net gain to users who no longer have to write mechanical pieces of boilerplate.

I would really like to preserve (or create, where absent) the ability to progress in complexity, instead of making people learn a bunch of Pydra concepts to get going.

I would very much favor this user-first approach to design. First thing may be to formalize what we assume our typical user should know before using and / or contributing to Pydra.

👍 Not wanting to sound too "agilely" but we probably need to think about multiple personas when considering any changes.

For instance, I did a quick lunch survey at my lab recently and very few people use, know or even care about typing in Python.

Since Pydra utilises the typing system to distinguish between file and non-file types, my thinking is that they will need to become familiar with Python's typing syntax in any case, and once they have gone to that effort that they will want to see some more benefits.

Besides, something we should definitely spend a decent amount of efforts on is the sad path. Error messages like AttributeError: 'Input' object has no attribute 'files_hash' or AttributeError: type object 'Workflow' has no attribute 'state' are quite off-putting, especially when the library lets you construct this workflow as if it were a valid one.

Guiding principles such as "make invalid state unrepresentable" and "avoid primitive obsession" maybe help us reach the appropriate level of abstraction. I believe that's where @tclose is leading us towards, I am just not sure betting everything on type checking is the right angle to it.

Another one I would throw in there that has come up a fair bit in my thoughts about this, and recently in practice when working on the Mrtrix3 auto-gen package, is separating user and core namespaces as much as possible. For example, I was trying to work out why the amp2reponse interface wasn't printing out the input image argument to the command line and finally worked out that because it is called "image" in Mrtrix, and translated as such to the auto-gen Pydra interface, it was clashing with the (container) image keyword configuration argument.

Happy to discuss it further in the next meeting if I can make it.

As I said no rush, we can always delay the discussion to a time that suits everyone.

tclose commented 1 year ago

This is an example


# Split out common arguments into separate attrs classes
@attrs.define(kw_only=True, slots=False)
class RecurseOption:
    recurse: bool = shell_arg(
        help_string="Recursively descend the directory tree",
        argstr="-R",
    )

# Define an interface for the `ls` command
@shell_task
class Ls:
    executable = "ls"

    class Inputs(RecurseOption):
        directory: os.PathLike = shell_arg(
            help_string="the directory to list the contents of",
            argstr="",
            mandatory=True,
            position=-1,
        )
        hidden: bool = shell_arg(
            help_string="display hidden FS objects",
            argstr="-a",
            default=False,
        )

    class Outputs:
        entries: list = shell_out(
            help_string="list of entries returned by ls command",
            callable=list_entries,
        )

or alternatively as a function in order to dynamically generate interfaces

Ls = shell_task(
    "Ls",
    executable="ls",
    input_fields={
        "directory": {
            "type": os.PathLike,
            "help_string": "the directory to list the contents of",
            "argstr": "",
            "mandatory": True,
            "position": -1,
        },
        "hidden": {
            "type": bool,
            "help_string": "display hidden FS objects",
            "argstr": "-a",
        },
    output_fields={
        "entries": {
            "type": list,
            "help_string": "list of entries returned by ls command",
            "callable": list_entries,
        }
    },
    inputs_bases=[RecursiveOptions],
)