RFC: Utility tasks built-in or pulled out

nipype / pydra

Pydra Dataflow Engine

https://nipype.github.io/pydra/

Other

120 stars 59 forks source link

RFC: Utility tasks built-in or pulled out #429

Open effigies opened 3 years ago

effigies commented 3 years ago

A lot of the things in nipype.interfaces.io and nipype.interfaces.utility would be useful to have around. Should we be making a task package for that or bundling directly into pydra.tasks?

satra commented 3 years ago

i think it would be good to discuss which ones and if there are alternative approaches in pydra. and then we can discuss where.

for example, i think rename is being built into the spec and we will want to put datasink also into the spec. identityinterface is no longer required.

effigies commented 3 years ago

Data grabbers and data sinks were the main things I was thinking about. But here's a list:

List operations (utility.base)
- Merge
- Select
- Split
CSV Reader(utility.csv)
Data grabbers (io)
- DataFinder
- DataGrabber
- S3DataGrabber
- SSHDatatGrabber
- SelectFiles

XNAT, BIDS, etc make sense not to put directly in pydra of course.

satra commented 3 years ago

thanks. these should be relatively easy to move over, since most are just python functions. we should decide where they should go pydra.tasks.core.io/utility so core is a package that only pydra provides.

djarecka commented 3 years ago

I'm debugging a pydra workflow from pydra-glm-example and I'm thinking about nipype interface - SelectFile. I believe we should discourage using Nipype1Task, but just creating a FunctionTask before we create pydra.SelectFile as suggested here.

I should create some examples, am I right that SelectFiles is mostly used as a connection from infosource with iterables?

effigies commented 3 years ago

I think it could be easily hooked up with iterables, but I don't know that it's "mostly used" that way. I haven't really used it, so I don't know for sure how others use it, and I've generally avoided iterables, so take that for what it's worth.

satra commented 3 years ago

conceptually selectfiles is just a simple interface to getting data, whether that is connected to infosource or not is up to each workflow creator. the reason why infosource/inputnode (both are identityinterfaces) is used is for dataflow purposes, which should not be required in the context of pydra's design (which makes a workflow a tasks and splits can be applied to any inputs).

djarecka commented 3 years ago

but do we want to create pydra.SelectFiles? It's very easy to create a python function and just add splitter

satra commented 3 years ago

i think we could have a set of utility functions that are general purpose across many use cases. but only if they are clear and prevents recreating the same code in many different workflows. if you put it in pydra, i would label the tasks as experimental in the sense that they could be moved out.

selectfiles is generally a non-cacheable function since it involves taking a look at folder that could have changed between runs and we may not want to necessarily hash the input directory. i think we need to be able to at least indicate that even if we end up not creating the function. so hashability should also be taken into consideration. perhaps think about how users would create/use such a function and see if it's better to provide one that reduces some of these complications.