nipype / pydra

Pydra Dataflow Engine
https://nipype.github.io/pydra/
Other
119 stars 57 forks source link

Type checking and coercion #662

Closed tclose closed 11 months ago

tclose commented 1 year ago

Types of changes

This PR contains a number of breaking changes:

Summary

Type checking

Adds detailed type checking at both run and construction time (i.e. lazy fields carry the type of the field they point to). This is done by pydra.utils.TypeParser, which is works as an attrs converter and is set to be the converter into every field in the generated input/output specs by default.

The type parser unwraps arbitrary nested generic types (e.g. List, Tuple and Dict in typing package) and tests that they either match or are a subclass of the declared types.

Type coercion

In addition to the type-checking, selected types will be automatically coerced to their declared type:

Like the type-checking, the coercion is performed between types within arbitrarily nested containers so can handle coercions, e.g.

dict[str, list[str]] -> dict[str, tuple[File]]

Hashing

Hashing is implemented (by @effigies) in a dispatch method, which handles hashes for non-ordered objects such as sets in a process-independent way.

Hashing of files is now handled outside of Pydra within the fileformats package. This enables the hashes to be aware of changes in associated files.

Copying files

Copying files (e.g. into working directories) is also handled in fileformats, and includes graceful fallback between leaving the files in-place, symlinking, hardlinking and copying files (in order of preference), taking into account whether the files need to be in the same folder/have the same file-stem.

Checklist

satra commented 1 year ago

@tclose - thanks for all the effort in bringing these different PRs and ideas together. the one thing that struck me was the introduction of gathered. this particular thing seems to be a requirement of type checking rather than readability of code. python isn't really designed to be that kind of language and it seems we are forcing some elements to users here to satisfy a developer constraint. i'm not for or against this at this stage. just wanted to bring it up so that the user perspective is considered.

a key goal of pydra was to keep it simple for users to not have to do things they don't fully understand. one element of this was the idea that pydra can be used in a jupyter notebook or a script without the need to construct a workflow or define an interface. simply used to parallelize elements of a script.

tclose commented 1 year ago

@tclose - thanks for all the effort in bringing these different PRs and ideas together. the one thing that struck me was the introduction of gathered. this particular thing seems to be a requirement of type checking rather than readability of code.

Yeah, I tried to avoid the need to add the "gathered" syntax as it seems a bit superfluous, but having to check whether a value is actually a list of that value, or a list of list of that value, etc... just makes the type-checking impractical (unless anyone has any good ideas). I personally think it makes the code a bit easier to understand if values that are to be split are explicitly marked as such, but definitely something it would be good to discuss.

Note that the "type-checking" process is actually as much about coercing strings/paths into File objects so we know how to hash them properly as it is about validation. So I think it will be hard to get away from some form of type-parsing (the existing code this PR replaces already had quite a bit of special-case handling to deal with File objects).

python isn't really designed to be that kind of language and it seems we are forcing some elements to users here to satisfy a developer constraint. i'm not for or against this at this stage. just wanted to bring it up so that the user perspective is considered.

I would perhaps argue that the direction of travel with Python is to move towards a stricter, type-checked, language. Certainly editors like VSCode are pushing you to write code that way, with type-checked linting switched on by default.

On the topic of linting, I have been thinking about what would it would take to facilitate static checking and hinting of pydra workflows. I will try to put together an issue summarising what would be required so we can also debate its merits.

a key goal of pydra was to keep it simple for users to not have to do things they don't fully understand. one element of this was the idea that pydra can be used in a jupyter notebook or a script without the need to construct a workflow or define an interface. simply used to parallelize elements of a script.

Perhaps if it had a better name than gathered it would be more intuitive. I played around with a few but couldn't come up with one that I really liked. Personally, I reckon by the time that the users are thinking about using split, they would already be looking up the docs and in that case having to use gathered wouldn't be too much of an extra burden. But I agree it isn't quite as clean.

codecov[bot] commented 1 year ago

Codecov Report

Patch coverage: 93.65% and project coverage change: +1.07% :tada:

Comparison is base (426564e) 81.77% compared to head (15666a5) 82.84%. Report is 4 commits behind head on master.

:exclamation: Current head 15666a5 differs from pull request most recent head 292fd3f. Consider uploading reports for the commit 292fd3f to get more accurate results

Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #662 +/- ## ========================================== + Coverage 81.77% 82.84% +1.07% ========================================== Files 20 22 +2 Lines 4400 4845 +445 Branches 1264 0 -1264 ========================================== + Hits 3598 4014 +416 - Misses 798 831 +33 + Partials 4 0 -4 ``` | Flag | Coverage Δ | | |---|---|---| | unittests | `82.84% <93.65%> (+1.07%)` | :arrow_up: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=nipype#carryforward-flags-in-the-pull-request-comment) to find out more. | [Files Changed](https://app.codecov.io/gh/nipype/pydra/pull/662?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=nipype) | Coverage Δ | | |---|---|---| | [pydra/\_\_init\_\_.py](https://app.codecov.io/gh/nipype/pydra/pull/662?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=nipype#diff-cHlkcmEvX19pbml0X18ucHk=) | `89.47% <ø> (-1.44%)` | :arrow_down: | | [pydra/engine/helpers\_file.py](https://app.codecov.io/gh/nipype/pydra/pull/662?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=nipype#diff-cHlkcmEvZW5naW5lL2hlbHBlcnNfZmlsZS5weQ==) | `92.34% <91.25%> (+6.24%)` | :arrow_up: | | [pydra/engine/core.py](https://app.codecov.io/gh/nipype/pydra/pull/662?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=nipype#diff-cHlkcmEvZW5naW5lL2NvcmUucHk=) | `92.72% <91.37%> (-0.40%)` | :arrow_down: | | [pydra/utils/hash.py](https://app.codecov.io/gh/nipype/pydra/pull/662?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=nipype#diff-cHlkcmEvdXRpbHMvaGFzaC5weQ==) | `92.85% <92.85%> (ø)` | | | [pydra/utils/typing.py](https://app.codecov.io/gh/nipype/pydra/pull/662?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=nipype#diff-cHlkcmEvdXRpbHMvdHlwaW5nLnB5) | `92.97% <92.97%> (ø)` | | | [pydra/engine/task.py](https://app.codecov.io/gh/nipype/pydra/pull/662?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=nipype#diff-cHlkcmEvZW5naW5lL3Rhc2sucHk=) | `93.65% <93.75%> (+0.07%)` | :arrow_up: | | [pydra/engine/helpers.py](https://app.codecov.io/gh/nipype/pydra/pull/662?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=nipype#diff-cHlkcmEvZW5naW5lL2hlbHBlcnMucHk=) | `86.59% <95.40%> (+0.54%)` | :arrow_up: | | [pydra/engine/specs.py](https://app.codecov.io/gh/nipype/pydra/pull/662?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=nipype#diff-cHlkcmEvZW5naW5lL3NwZWNzLnB5) | `94.72% <96.15%> (-0.08%)` | :arrow_down: | | [pydra/engine/audit.py](https://app.codecov.io/gh/nipype/pydra/pull/662?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=nipype#diff-cHlkcmEvZW5naW5lL2F1ZGl0LnB5) | `96.55% <100.00%> (+0.12%)` | :arrow_up: | | [pydra/engine/helpers\_state.py](https://app.codecov.io/gh/nipype/pydra/pull/662?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=nipype#diff-cHlkcmEvZW5naW5lL2hlbHBlcnNfc3RhdGUucHk=) | `95.67% <100.00%> (+0.06%)` | :arrow_up: | | ... and [1 more](https://app.codecov.io/gh/nipype/pydra/pull/662?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=nipype) | | ... and [1 file with indirect coverage changes](https://app.codecov.io/gh/nipype/pydra/pull/662/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=nipype)

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

djarecka commented 1 year ago

on one hand I don't like introducing like gathered (not sure what would be the best name, we can think), on the other hand I'm not sure if it decreases readability, or actually improves the readability... It does explicitly points to the argument that should have ben modified when moving the task to the splitting option, so it might help with debugging.

effigies commented 1 year ago

Thinking a bit about gathered and trying to find a way out. I'm worried that a special type will add confusion when we do something like A -> B -> C where A is mapped over input, B accepts a list, and C again maps over inputs.

I believe the issue here is in:

a = TaskA(x=[1, 2, 3, 4]).split("x")

When x is passed to TaskA(), we don't know whether it is a state-mapping list or a regular list, so we don't know whether to check compatibility with int or list[int]. Is this correct?

What if we made the syntax instead:

a = TaskA().split("x", x=[1, 2, 3, 4]).combine("x")

In this formulation, x does not get set on a, but on a.state. We can put a runtime check that the inner type of the sequence matches the type of a.inputs.x, but it does not need to be explicitly called as a gathered.

Lazy inputs/outputs are not going to be reasonably type-checkable, but we could still build-time check them, and should be able to tell whether inputs are mapped or not. We could use a StateList NewType or subclass to ensure that we unwrap things correctly. .split() would add a layer, while .combine() would remove one.

tclose commented 1 year ago

Thinking a bit about gathered and trying to find a way out. I'm worried that a special type will add confusion when we do something like A -> B -> C where A is mapped over input, B accepts a list, and C again maps over inputs.

Mmmm, yes I think you are right. The gathered (or StateList) object gets propagated from mapped states and could be inputted into B, but then the output of B would be straight list again.

I believe the issue here is in:

a = TaskA(x=[1, 2, 3, 4]).split("x")

When x is passed to TaskA(), we don't know whether it is a state-mapping list or a regular list, so we don't know whether to check compatibility with int or list[int]. Is this correct?

Yes

What if we made the syntax instead:

a = TaskA().split("x", x=[1, 2, 3, 4]).combine("x")

In this formulation, x does not get set on a, but on a.state. We can put a runtime check that the inner type of the sequence matches the type of a.inputs.x, but it does not need to be explicitly called as a gathered.

I think this could work and is quite clear I think. With LazyFields, would it look like?

> a = TaskC().split("x", x=wf.b.lzout.out)

Lazy inputs/outputs are not going to be reasonably type-checkable,

Do you mean statically by linters?

but we could still build-time check them, and should be able to tell whether inputs are mapped or not. We could use a StateList NewType or subclass to ensure that we unwrap things correctly. .split() would add a layer, while .combine() would remove one.

I believe the PR I'm working on the finishing touches on does this correctly now

effigies commented 1 year ago

With LazyFields, would it look like?

> a = TaskC().split("x", x=wf.b.lzout.out)

That would make sense to me. I think TaskC(x=wf.b.lzout.out).split('x') would be problematic.

For a transition, I think we could easily catch cases in

https://github.com/nipype/pydra/blob/426564eca3efa7d840fdca33bdbef6a1c88e2ffc/pydra/engine/core.py#L169-L172

where the input is a list of the expected type and suggest setting the values in split(). This isn't actually a new syntax, but we would need to change the behavior to track these inputs in .state (or some ._stateful_inputs, if we want to keep the .state object itself clean).

Lazy inputs/outputs are not going to be reasonably type-checkable,

Do you mean statically by linters?

Yes. And at runtime, by the time we do type checking/coercion it will already have been split along the list, whether that's bare list or a subclass/newtype.

tclose commented 1 year ago

Thinking a bit about gathered and trying to find a way out. I'm worried that a special type will add confusion when we do something like A -> B -> C where A is mapped over input, B accepts a list, and C again maps over inputs.

I'm trying to write a test for this case, but I'm not sure I understanding the splitting/combining syntax properly. Is this a valid workflow?

    wf = Workflow(name="test", input_spec=["x", "y"])

    wf.add( # Split over workflow input "x" on "scalar" input
        list_mult_sum(
            in_list=wf.lzin.x,
            scalar=wf.lzin.x,
            name="A",
        ).split("scalar")
    )

    wf.add(  # Workflow is still split over "x"
        list_mult_sum(
            name="B",
            scalar=wf.A.lzout.sum,
            in_list=wf.A.lzout.products,
        )
    )

    wf.add(  # Workflow is combined over "x"
        list_mult_sum(
            name="C",
            scalar=wf.lzin.y,
            in_list=wf.B.lzout.sum,
        ).combine("A.scalar")
    )

    wf.add(  # Workflow is split again, this time over C.products
        list_mult_sum(
            name="D",
            scalar=wf.C.lzout.products,
            in_list=wf.lzin.x,
        ).split("scalar")
    )

    wf.add(  # Workflow is finally combined again into a single node
        list_mult_sum(name="E", scalar=wf.lzin.y, in_list=wf.D.lzout.sum).combine(
            "D.scalar"
        )
    )

    wf.set_output(
        [
            ("alpha", wf.D.lzout.sum),
            ("beta", wf.D.lzout.products)
        ]
    )

where


@mark.task
@mark.annotate({"return": {"sum": int, "products": ty.List[int]}})
def list_mult_sum(scalar: int, in_list: ty.List[int]) -> ty.Tuple[int, ty.List[int]]:
    products = [scalar * x for x in in_list]
    return functools.reduce(operator.add, products, 0), products
djarecka commented 1 year ago

@tclose - i've read the workflow and I think it should work, but it uses something what we called "inner splitter", since it's not really splitting over an input, but an output of a node (that could be a list).

Does it fail? I could try to run it later

tclose commented 1 year ago

@tclose - i've read the workflow and I think it should work, but it uses something what we called "inner splitter", since it's not really splitting over an input, but an output of a node (that could be a list).

Does it fail? I could try to run it later

I worked out the error was in the combine call, which I had meant to combine the inputs to C, and therefore should have been done on node B

tclose commented 1 year ago

Note that the coverage actually increases >1% but codecov is confused

tclose commented 1 year ago

The errors that have just shown up with Ubuntu >= 3.10 appear to be an intermittent failure where the output fields aren't being set properly (i.e. result.output.* of shell tasks sometimes are attr.NOTHING). I believe have run across it before but haven't been able to pin it down due to its intermittent nature.

For example, testing on Linux ip-10-0-2-164 5.19.0-1027-aws #28~22.04.1-Ubuntu SMP Wed May 31 18:30:36 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux with Python==3.10.6, I don't get any errors.

It appears to be an issue with multiprocessing, as switching test_shell_cmd_inputsspec_11 to use the "serial" plugin instead of "cf" resolves the issue and test_shell_cmd_outputspec_7 and test_shell_cmd_outputspec_7a are only failing on the "cf" test parameterization.

djarecka commented 1 year ago

thanks so much for doing it! will be reviewing!

I understand that right now you have issues "only" with Singularity GA, yes? your yesterday comment is outdated?

edit looks like issues with singularity was temporary, works now!

djarecka commented 11 months ago

Thank you again for the PR! I think we should merge it if you don't want to add anything more here! During the OHBM hackathon I will try to work on tutorial updates, so if I have any questions or other suggestions we can do in a separate pr.

tclose commented 11 months ago

Thank you again for the PR! I think we should merge it if you don't want to add anything more here! During the OHBM hackathon I will try to work on tutorial updates, so if I have any questions or other suggestions we can do in a separate pr.

Great, thanks!

tclose commented 11 months ago

This is amazing work, Tom! I hope to get through the rest before we talk, but in general I'm really impressed how this has come together in a way that keeps the basic type-free functionality intact.

Thanks 🙏 I'm really chuffed you think so 😊

On the subject of keeping the basic type-free functionality intact, one alteration I have been considering that would be good to discuss in the meeting, is that of relaxing the type-checking at construction time (i.e. lazy fields) so that base classes can be passed to sub-classed type fields to avoid need to add cast statements between loosely-typed tasks and tightly-typed tasks. Stricter type-checking/coercing can still be performed at runtime.

For example, if B is a subclass of A, then currently you can connect a B-typed output into a A-typed input but if you wanted to connect A-typed to B-typed you would need to cast it to B first. While this aligns with how traditional type-checking is done, we might want to be a bit more flexible so you could connect generic File type fields to Nifti fields for example. If both upstream and downstream nodes use to specific formats that don't match, e.g. MrtrixImage -> Nifti then we would still raise an error.

It would somewhat reduce the effectiveness of the type-checking, but would probably avoid any (false-positive) cases where the type-checking could be annoying