nipype / pydra

Pydra Dataflow Engine
https://nipype.github.io/pydra/
Other
119 stars 57 forks source link

Output paths from reader tasks gets moved and rewritten #751

Open ghisvail opened 2 months ago

ghisvail commented 2 months ago

I have implemented a task which reads a bunch of files from a BIDS dataset, with the following signatures:

@task
@annotate({"return": {
    "dataset_description": dict,
    "participant_ids": list[str],
    "session_ids": list[str],
}})
def read_bids_dataset(dataset_path: Path):
    ...

@task
@annotate({"return": {"files": list[Path]}})
def read_bids_files(
    dataset_path: Path,
    participant_id: str,
    session_id: str,
    datatype: str,
    suffix: str,
    extension: str,
):
    ...

# Build workflow composing the two tasks above
def build_bids_reader(bids_queries: dict, **kwargs) -> Workflow:
    ...

If I sequence both tasks manually, I get the list of BIDS files from the source path as expected.

If I compose them in a workflow, I still get the BIDS files but moved to the workflow working directory.

I have never witnessed that behavior before, and believe this may be a regression compared to versions of Pydra prior to 0.23. In my opinion, results obtained from the sequential task execution and the workflow should be equivalent. Besides, copying the BIDS files can become a big problem if the dataset in huge in terms of number of participant / session combinations, or if the queried modality features large volume data, such as DWI.

A quick debug session indicates that this area of the code may be at cause.

tclose commented 2 months ago

So is the problem that relative paths are treated as being relative to the internal working directory instead of the working directory the workflow is launched from, or are absolute paths also being treated as relative to the internal directory?

ghisvail commented 2 months ago

So is the problem that relative paths are treated as being relative to the internal working directory instead of the working directory the workflow is launched from, or are absolute paths also being treated as relative to the internal directory?

The former I believe. Absolute paths should be untouched, relative paths (possibly generated by the task) should be turned absolute using the current copy mechanism to the task or workflow directory. This way files always get passed as absolute paths between tasks or workflows, which avoids potentially expensive copies.