nipy / nipype

Workflows and interfaces for neuroimaging packages
https://nipype.readthedocs.org/en/latest/
Other
747 stars 530 forks source link

Simplify connections. #589

Open chrisgorgo opened 11 years ago

chrisgorgo commented 11 years ago

Introduce "natural connections". If there is only one way to connect two nodes (considering the number and types of inputs outputs) user should be able to do:

wf.connect(nodeA, nodeB)

This would allow to streamline simple sequential workflow definition:

wf.connect([nodeA, nodeB, nodeC, nodeD])
satra commented 11 years ago

this is what wheels (the software that daniel built on nipype) did.

https://github.com/FNNDSC/wheels

but in general there are very few use cases where the joining works automatically without decorating the input and output specs further.

take a workflow and try finding the number of times this would be useful - at least in the workflows i use there are few instances.

chrisgorgo commented 11 years ago

@maartenmennes designed a framework out of frustration how complicated defining workflows was in nipype. Maybe this use case is not so uncommon...

satra commented 11 years ago

i spoke to maarten, and i agree it would be good to take a look at his framework to see what we can learn. but i don't think making connections was the core issue. these were the issues i inferred (maarten can correct me - please do - we want to improve nipype so that you could use it too!):

i think the following

wf.connect(nodeA, nodeB)

is a tough problem without more extended semantics in the input and output spec.

while i would argue:

wf.connect(nodeA, 'outputname', nodeB, 'inputname')

is easy and explicit but might be considered verbose by some folks.

inati commented 11 years ago

Hi All,

I'm new to nipype, but having spent time with other pipelining frameworks, I would say that the hard part is not really the verbosity of this:

wf.connect(nodeA, 'outputname', nodeB, 'inputname')

For the applications that I'm more familiar with (the image recon side of things) the problems have a different performance curve. For example, if what you are handing off from nodeA to nodeB can fit into memory you are much better off. Ideally you just pass a pointer and then that's super efficient. But then it's hard to go across processes or compute nodes or to and from a GPU because at every node boundary you have to serialize and deserialize. It's ugly.

I think that for the nipype applications the performance is sort of limited by individual steps in the pipeline. We've basically said that it doesn't make sense to worry about disk space (at least in the scratch sense) we're just going to save all the intermediate results. We're going to use nifti or gifti or whatever at the node boundaries, actually it's more like, it's up to the pipeline architect to make sure that the output of nodeA is compatible with the input of nodeB. In my opinion this is a good thing. It's a bit harder to write the pipeline because you have to pay attention to these details, but it's way way better than having a system that can guess and does it wrong. We're also going to brute force the parallelization and just find the bits that are "stupidly parallel" and break there. IMHO this is good too.

Anyway, my question is a bit different: Given a pipeline, how can I "profile it" to find the bottle-necks?

Cheers, Souheil


Souheil Inati, PhD Staff Scientist FMRI Facility NIMH/NIH/DHHS souheil.inati@nih.govmailto:souheil.inati@nih.gov

On Jun 26, 2013, at 9:11 PM, Satrajit Ghosh notifications@github.com<mailto:notifications@github.com> wrote:

i spoke to maarten, and i agree it would be good to take a look at his framework to see what we can learn. but i don't think making connections was the core issue. these were the issues i inferred (maarten can correct me - please do - we want to improve nipype so that you could use it too!):

i think the following

wf.connect(nodeA, nodeB)

is a tough problem without more extended semantics in the input and output spec.

while i would argue:

wf.connect(nodeA, 'outputname', nodeB, 'inputname')

is easy and explicit but might be considered verbose by some folks.

— Reply to this email directly or view it on GitHubhttps://github.com/nipy/nipype/issues/589#issuecomment-20091310.

maartenmennes commented 11 years ago

Hi Chris,

thanks for starting this discussion. First off, I have to say that my experience with nipype is not from sitting down and trying hard to learn it, but rather from sitting down, staring at it a bit, understanding what is does, bet then failing to get to the easy/actual implementation.

Here's a couple of points though:

["reorient","selectvolumes","get_example","motioncorrection_fsl","skullstrip_fsl","grandmeanscaling","get_mask","spatialsmoothing6","bet_examplefunc","bbreg","nuisance_masks", ["highpassfilter", "finalize", "get_mask", "melodic"], ["nr_wm_csf_lin", ["highpassfilter", "finalize", "get_mask", "melodic"]], ["nr_6_wm_csf_lin", ["highpassfilter", "finalize", "get_mask", "melodic"]], ["nr_fw24_wm_csf_lin", ["highpassfilter", ["finalize","get_mask", "melodic"], ["scrubbing_FDJenkinson","finalize","get_mask"]], ["bandpassfilter", ["finalize","get_mask"], ["scrubbing_FDJenkinson","finalize","get_mask"]]], ["nr_fw24_wm_csf_lin_spike", ["highpassfilter", "finalize","get_mask", "melodic"]], ["nr_fr24_wm_csf_lin_spike", ["highpassfilter", "finalize","get_mask"]]]

In theory each node connects to the output of the previous node, but some nodes might get skipped in that chaining proces. e.g., the input to motioncorrection_fsl are the outputs of selectvolumes and the output of get_example. The output of that workflow is then a bash file that I can easily submit to our local cluster.

I think we can agree that the above nested list is pretty easy to understand and follow, so yes, I think

wf.connect(nodeA, 'outputname', nodeB, 'inputname')

is too verbose, especially because 'outputname' and 'inputname' are typically variables and thus don't really give you any extra information. You would have to properly trace them to see that you're actually using the correct files.

I think nipype provides uber flexibility - which is great -, but in my opinion it might up to a point where it's almost unnecessary, as in: there is a certain order to processing an image (at least that we use now). Using some of that intrinsic order might indeed simplify the creation of workflows.

In summary:

Cheers, Maarten

On Wed, Jun 26, 2013 at 10:05 PM, Souheil Inati notifications@github.comwrote:

Hi All,

I'm new to nipype, but having spent time with other pipelining frameworks, I would say that the hard part is not really the verbosity of this:

wf.connect(nodeA, 'outputname', nodeB, 'inputname')

For the applications that I'm more familiar with (the image recon side of things) the problems have a different performance curve. For example, if what you are handing off from nodeA to nodeB can fit into memory you are much better off. Ideally you just pass a pointer and then that's super efficient. But then it's hard to go across processes or compute nodes or to and from a GPU because at every node boundary you have to serialize and deserialize. It's ugly.

I think that for the nipype applications the performance is sort of limited by individual steps in the pipeline. We've basically said that it doesn't make sense to worry about disk space (at least in the scratch sense) we're just going to save all the intermediate results. We're going to use nifti or gifti or whatever at the node boundaries, actually it's more like, it's up to the pipeline architect to make sure that the output of nodeA is compatible with the input of nodeB. In my opinion this is a good thing. It's a bit harder to write the pipeline because you have to pay attention to these details, but it's way way better than having a system that can guess and does it wrong. We're also going to brute force the parallelization and just find the bits that are "stupidly parallel" and break there. IMHO this is good too.

Anyway, my question is a bit different: Given a pipeline, how can I "profile it" to find the bottle-necks?

Cheers, Souheil


Souheil Inati, PhD Staff Scientist FMRI Facility NIMH/NIH/DHHS souheil.inati@nih.govmailto:souheil.inati@nih.gov

On Jun 26, 2013, at 9:11 PM, Satrajit Ghosh <notifications@github.com mailto:notifications@github.com> wrote:

i spoke to maarten, and i agree it would be good to take a look at his framework to see what we can learn. but i don't think making connections was the core issue. these were the issues i inferred (maarten can correct me - please do - we want to improve nipype so that you could use it too!):

  • python framework, but maarten is used to shell scripting
  • new input and output names to learn when the flags he knew could be used easily. he had to look up nipype's help every single time.
  • level of parallelism was not necessary for his needs - he could submit a workflow per subject
  • things get hidden when a workflow is created - you cannot read off the set of commands you ran.

i think the following

wf.connect(nodeA, nodeB)

is a tough problem without more extended semantics in the input and output spec.

while i would argue:

wf.connect(nodeA, 'outputname', nodeB, 'inputname')

is easy and explicit but might be considered verbose by some folks.

— Reply to this email directly or view it on GitHub< https://github.com/nipy/nipype/issues/589#issuecomment-20091310>.

— Reply to this email directly or view it on GitHubhttps://github.com/nipy/nipype/issues/589#issuecomment-20092797 .

Maarten Mennes, Ph.D. Post-Doctoral Researcher Donders Institute for Brain, Cognition and Behaviour Dept. of Cognitive Neuroscience Radboud University Nijmegen Medical Centre Nijmegen The Netherlands

Google Scholar Author Linkhttp://scholar.google.com/citations?user=pLlSTVgAAAAJ&hl=en

satra commented 11 years ago

thanks @maartenmennes - this is useful.

  • easier access to the actual commands (when using a complex pipeline, without having to go into each subnode directory) would be points that I would find very attractive.

i completely agree with the tracking part. we had hoped other people would build that - but sadly that didn't happen - we even provided a Debug Plugin which could be used and callbacks on node execution start and stop. this is high priority, but i really want to do this right - our previous attempts, including a crazy web page that could be opened only in firefox, didn't quite work. and various people have created their own.

started a separate thread for this here #590 - and i'll continue there.

  • easier ways to specify workflows/nodes

first to address a misconception - outputname and inputname are not variables, they are named ports. this means they should tell you semantically what output you are connecting to what input. in fact, things are named precisely to avoid checking for the right files.

except in specific cases, i really don't see a way to avoid named connections without creating much richer metadata for inputs and outputs. in your example, i believe you are making very specific assumptions about the processes that you are allowing to be specified in your pipeline and pushing the spec into the process. (btw, in some ways it reminds me of psom :) )

as i pointed to chris earlier, this sounds a bit like the wheels project: https://github.com/FNNDSC/wheels

however, there is a reason why any existing general workflow framework requires named/targeted connections. essentially workflows are graphs not trees, which means nodes may take any number of arbitrary inputs from other nodes. the only way to specify this for n_inputs > 1 is to use both the source node name and the output name (which is why wheels doesn't work if you have a conflict in the output name from two different nodes).

so i'm quite curious to know how in the above spec motioncorrection_fsl knows where to get the two inputs from.

here is a recent connectivity pattern suggested by @ihrke which simplifies the specification while retaining the specificity.

https://github.com/ihrke/nipype_connect_str

i do want to simplify workflow specification (and i do want some amount of verbosity), but i haven't yet found something simpler (other than this new grammar suggested by matthias - which we will probably provide in an upcoming release).

i'm all ears, if anyone has better ideas.

btw, the reduced flexibility approach can be built on top of nipype by creating a linearized spec class - this is what the connectome viewer folks did and in someways what cpac is doing.

mwaskom commented 11 years ago

"but in general there are very few use cases where the joining works automatically without decorating the input and output specs further."

I'm not sure that's really the case. How many of these lines exist in your workflows?

wf.connect(node_a, "out_file", node_b, "in_file")

It's certainly nice to have flexibility, but we already have a convention for privileged input and output fields. It seems reasonable to extend this to a default connection behavior.

I do think it makes sense not to overload wf.connect() even further. Something like

wf.auto_connect(node_a, node_b)

that expands behind the scenes to

wf.connect(node_a, "out_file", node_b, "in_file")

Seems straightforward. Being a fan of of the other syntax for connect, I think this should also be possible:

wf.auto_connect([(node_a, node_b), (node_b, node_c), (node_b, node_d)])

PS apologies for accidental closing...hitting tab did not do what I expected it to.

satra commented 11 years ago

I'm not sure that's really the case. How many of these lines exist in your workflows?

wf.connect(node_a, "out_file", node_b, "in_file")

do a grin "out_file" in the examples directory. the fsl dominant workflows have this pattern - but we support many other packages.

i like auto_connect. but across interfaces, these connections cannot be inferred automatically. i still think the naming convention would need to be cleaned up across interfaces (including fsl) and those privileged inputs and outputs decorated with metadata (primary_output=True, primary_input=True). but even after this the user will need to mix auto_connect and connect.

mwaskom commented 11 years ago

+1 for primary_input and primary_output as trait metadata, although fallback to in_file and out_file by default seems reasonable (for things like Function interfaces). Also agree that some effort to move towards standardized in_file and out_file would be good in cases where that makes sense (e.g. mri_binarize -> out_file instead of binary_file). Don't see a huge problem with mixing the two connection styles; if anything it emphasizes the distinction between the main pipe and ancillary ones.

inati commented 11 years ago

It's not just the names. You're making a contract between the nodes as to what the data is going to be. You could say it's a single nifti file. But is it only floats or can I also give you shorts? The node interfaces could be very strongly typed. I'm not a fan of this.

Or maybe I'm confused and you are just saying that you want a convenience function so that "connect" when called with two nodes just hooks up the primary output to the primary input. That seems simple.

Sent from a mobile device. Please excuse brevity or typographical errors.

-----Original Message----- From: Michael Waskom [notifications@github.commailto:notifications@github.com] Sent: Thursday, June 27, 2013 11:01 PM Eastern Standard Time To: nipy/nipype Cc: Inati, Souheil (NIH/NIMH) [E] Subject: Re: [nipype] Simplify connections. (#589)

+1 for primary_input and primary_output as trait metadata, although fallback to in_file and out_file by default seems reasonable (for things like Function interfaces). Also agree that some effort to move towards standardized in_file and out_file would be good in cases where that makes sense (e.g. mri_binarize -> out_file instead of binary_file). Don't see a huge problem with mixing the two connection styles; if anything it emphasizes the distinction between the main pipe and ancillary ones.

— Reply to this email directly or view it on GitHubhttps://github.com/nipy/nipype/issues/589#issuecomment-20167959.

mwaskom commented 11 years ago

Your second paragraph is what I meant. I'm not sure I follow the first part, but I guess it doesn't matter.

oesteban commented 10 years ago

The InterfacedWorkflows I'm developing in this PR #882 can work as first suggested.

dmwelch commented 9 years ago

I'd LOVE to see something like this! Just off the cuff, perhaps something as simple as

wf.node2.inputA = wf.node1.outputB

would be more palatable? Admittedly, it doesn't take into account map nodes and iterables... :-(