psychoinformatics-de / datalad-hirni

DataLad extension for (semi-)automated, reproducible processing of (medical/neuro)imaging data
http://datalad.org
Other
5 stars 8 forks source link

Experience report: DICOM->GLM Stats #120

Open mih opened 5 years ago

mih commented 5 years ago

Target: Do a full GLM analysis with all the great machinery in the cmdline. Preferably without intermediate scriting.

Bottom line: All actually tricky parts of the workflow work well, but quite a few minor issue make it needlessly hard to succeed within a reasonable time frame. Here is the protocol with comment and links to issues:

study raw data

Fresh dataset

datalad create -c hirni raw

Import DICOM (will create subdatasets inside)

datalad hirni-import-dcm -d raw \
   --anon-subject 001 \
  https://github.com/datalad/example-dicom-structural/archive/master.tar.gz \
  acq1
datalad hirni-import-dcm -d raw \
   --anon-subject 001 \
  https://github.com/datalad/example-dicom-functional/archive/master.tar.gz \
  acq2

Add non-DICOM study data

#datalad download-url -d raw -O raw/acq2/events.tsv https://github.com/datalad/example-dicom-functional/raw/master/events.tsv
git -C raw/acq2 \
  annex \
  addurl https://github.com/datalad/example-dicom-functional/raw/master/events.tsv \
  --file events.tsv
datalad save -d raw -m "Added stimulation protocol for acquisition 2"

Configure a "converter" for the stimulation protocol, to be used for BIDSification

datalad hirni-spec4anything -d raw \
  acq2/events.tsv \
  --properties '{"procedures": {"procedure-name": "copy-converter", "procedure-call": "bash {script} {{location}} {ds}/sub-{{bids-subject}}/func/sub-{{bids-subject}}_task-{{bids-task}}_run-{{bids-run}}_events.tsv"}, "type": "events_file"}'

BIDSification

BIDS-compliant dataset is a fresh dataset

datalad create -c bids bids

that has the study raw dataset linked

datalad -C bids install -d . -r -s ../raw sourcedata

Convert everything to BIDS by selection all desired studyspecs.

cd bids
datalad hirni-spec2bids -d . --anonymize $(find . -name studyspec.json)
cd ..

analysis

Any analysis is also a fresh dataset

datalad create -c yoda analysis
cd analysis

that has the BIDSified data linked as input

datalad install -d . -s ../bids inputs/rawdata

Build an FSL analysis container based on a recipe (note: takes a fraction of the download time from singularity hub). Analysis in a container == better chances to reproduce.

Get the container build specifications.

datalad download-url \
  -m 'FSL container spec' \
  -O code/singularity.fsl \
  https://github.com/psychoinformatics-de/datalad-hirni/raw/master/datalad_hirni/tests/resources/fsl-simg-spec

Build the actual container (include a permission fix to be able to get the image file tracked in the dataset).

# singularity doesn't create the target dir
mkdir -p .datalad/environments
# build container image
datalad run \
  -i code/singularity.fsl \
  -o .datalad/environments/fsl.simg \
  "sudo bash -x -c 'singularity build {outputs} {inputs} && sudo chown --reference .datalad {outputs}'"
# register container for datalad
datalad containers-add -d . \
  -i .datalad/environments/fsl.simg \
  --call-fmt 'singularity exec {img} {cmd}' \
  fsl

Add the remaining pieces to be able to compute a GLM analysis (a script to convert the stimulation log into FSL EV3 files, and a template analysis configuration).

datalad download-url -m 'Analysis template+code' \
  -O code/ \
  https://github.com/psychoinformatics-de/datalad-hirni/raw/master/datalad_hirni/tests/resources/events2ev3.sh \
  https://github.com/psychoinformatics-de/datalad-hirni/raw/master/datalad_hirni/tests/resources/ffa_design.fsf

Convert the stimulus log into EV3 files and build an actual analysis configuration for sub-001 from the template.

# build onset files for FSL
datalad run -m "Build FSL EV3 design files" \
  -i inputs/rawdata/sub-001/func/sub-001_task-oneback_run-01_events.tsv \
  -o sub-001/onsets \
  'bash code/events2ev3.sh sub-001 {inputs}'

# build analysis config for FSL
datalad run -m "FSL FEAT analysis config script" \
  -i code/ffa_design.fsf \
  -o sub-001/1stlvl_design.fsf \
  "bash -c 'sed -e \"s,##BASEPATH##,{pwd},g\" -e \"s,##SUB##,sub-001,g\" {inputs} > {outputs}'"

Compute the GLM analysis

datalad containers-run -n fsl -m 'sub-001 1st-level GLM' \
  -i sub-001/1stlvl_design.fsf \
  -i sub-001/onsets -i inputs/rawdata/sub-001/func/sub-001_task-oneback_run-01_bold.nii.gz \
  -o sub-001/1stlvl_glm.feat \
  'feat {inputs[0]}'

Inform the dataset that it carries FSL FEAT, and run provenance metadata. And pull it out.

# this one requires https://github.com/psychoinformatics-de/datalad-hirni/pull/102
git config --file .datalad/config --add datalad.metadata.nativetype metalad_fslfeat
git config --file .datalad/config --add datalad.metadata.nativetype metalad_runprov
datalad save

datalad meta-aggregate -r

metadata

Here are a few pointers re where we stand in terms of metadata availability at the analysis end. Here is a dump for thresh_zstat1.nii.gz, i.e. this metadata is immediately bound to the path, and accessible without query.

datalad -f json_pp meta-dump --reporton files /tmp/analysis/sub-001/1stlvl_glm.feat/thresh_zstat1.nii.gz

(in DataLad's internal data structure)

{
  "action": "meta_dump",
  "datalad_version": "0.12.0rc4.dev109",
  "dsid": "033d1216-85e5-11e9-8257-f0d5bf7b5561",
  "metadata": {
    "metalad_core": {
      "@id": "datalad:MD5E-s21450--5d94b19ed98006bc9b4067a657f3eb0f.nii.gz",
      "contentbytesize": 21450
    },
    "metalad_fslfeat": {
      "@id": "datalad:MD5E-s21450--5d94b19ed98006bc9b4067a657f3eb0f.nii.gz",
      "@type": "prov:Entity",
      "crypto:sha512": "64727e1fda4d09cb89493ef3e40211ce03bc1ba17b4dcc1795e41325f2d7472137bbadeb8365089eb6c411fcaac905c5cd89d57b9bb12bff7e1b207b7f625e94",
      "dc:description": {
        "@id": "datalad:MD5E-s163848--f9ed19715cbb7dc2f84a23c5493863c0.png"
      },
      "dct:format": "image/nifti",
      "nidm:NIDM_0000098": {
        "@id": "niiri:502b7430-ec4e-42d8-8e09-327c60b222db"
      },
      "nidm:NIDM_0000104": {
        "@id": "niiri:429eea90-222a-4083-9f8c-424c02fa9653"
      },
      "prov:atLocation": {
        "@type": "xsd:anyURI",
        "@value": "ExcursionSet.nii.gz"
      },
      "prov:wasGeneratedBy": {
        "@id": "niiri:facb12c4-1b14-41cc-a650-7e9dd9c08528"
      },
      "rdfs:label": "Excursion Set Map"
    },
    "metalad_runprov": {
      "@id": "datalad:MD5E-s21450--5d94b19ed98006bc9b4067a657f3eb0f.nii.gz",
      "@type": "entity",
      "prov:wasGeneratedBy": {
        "@id": "14bd68c839c5254764044d88893abef4d9e5e716"
      }
    }
  },
  "parentds": "/tmp/analysis",
  "path": "/tmp/analysis/sub-001/1stlvl_glm.feat/thresh_zstat1.nii.gz",
  "refcommit": "14bd68c839c5254764044d88893abef4d9e5e716",
  "status": "ok",
  "type": "file"
}

It know that this file was generated by 14bd68c839c5254764044d88893abef4d9e5e716, and this is what we know about it:

datalad -f json_pp meta-dump --reporton datasets | jq '.metadata.metalad_runprov."@graph"| .[] | select(."@id" == "14bd68c839c5254764044d88893abef4d9e5e716")'
{
  "@id": "14bd68c839c5254764044d88893abef4d9e5e716",
  "@type": "activity",
  "atTime": "2019-06-03T12:02:07+02:00",
  "prov:wasAssociatedWith": {
    "@id": "ffa915b768c7d3096081265387bdaa4b"
  },
  "rdfs:comment": "[DATALAD RUNCMD] sub-001 1st-level GLM"
}

Ah this guy:

datalad -f json_pp meta-dump | jq '.metadata.metalad_runprov."@graph"| .[]? | select(."@id" == "ffa915b768c7d3096081265387bdaa4b")' 
{
  "@id": "ffa915b768c7d3096081265387bdaa4b",
  "@type": "agent",
  "email": "michael.hanke@gmail.com",
  "name": "Michael Hanke"
}

What else did he damage:

datalad meta-dump --reporton jsonld | jq '.[]."@graph"| .[]? | select(."@type" == "activity", ."prov:wasAssociatedWith"."@id" == "ffa915b768c7d3096081265387bdaa4b")'
{
  "@id": "14bd68c839c5254764044d88893abef4d9e5e716",
  "@type": "activity",
  "atTime": "2019-06-03T12:02:07+02:00",
  "prov:wasAssociatedWith": {
    "@id": "ffa915b768c7d3096081265387bdaa4b"
  },
  "rdfs:comment": "[DATALAD RUNCMD] sub-001 1st-level GLM"
}
{
  "@id": "14bd68c839c5254764044d88893abef4d9e5e716",
  "@type": "activity",
  "atTime": "2019-06-03T12:02:07+02:00",
  "prov:wasAssociatedWith": {
    "@id": "ffa915b768c7d3096081265387bdaa4b"
  },
  "rdfs:comment": "[DATALAD RUNCMD] sub-001 1st-level GLM"
}
{
  "@id": "1dbd4a49474116a6bb0498db5111359f02f1ced9",
  "@type": "activity",
  "atTime": "2019-06-03T11:55:37+02:00",
  "prov:wasAssociatedWith": {
    "@id": "ffa915b768c7d3096081265387bdaa4b"
  },
  "rdfs:comment": "[DATALAD RUNCMD] sudo bash -x -c 'singularity build .data..."
}
{
  "@id": "1dbd4a49474116a6bb0498db5111359f02f1ced9",
  "@type": "activity",
  "atTime": "2019-06-03T11:55:37+02:00",
  "prov:wasAssociatedWith": {
    "@id": "ffa915b768c7d3096081265387bdaa4b"
  },
  "rdfs:comment": "[DATALAD RUNCMD] sudo bash -x -c 'singularity build .data..."
}
{
  "@id": "4b15d5b395af61911a2108c5d0220cca4dd988d6",
  "@type": "activity",
  "atTime": "2019-06-03T11:56:37+02:00",
  "prov:wasAssociatedWith": {
    "@id": "ffa915b768c7d3096081265387bdaa4b"
  },
  "rdfs:comment": "[DATALAD RUNCMD] Build FSL EV3 design files"
}
{
  "@id": "4b15d5b395af61911a2108c5d0220cca4dd988d6",
  "@type": "activity",
  "atTime": "2019-06-03T11:56:37+02:00",
  "prov:wasAssociatedWith": {
    "@id": "ffa915b768c7d3096081265387bdaa4b"
  },
  "rdfs:comment": "[DATALAD RUNCMD] Build FSL EV3 design files"
}
{
  "@id": "4bee9bbcfee183250c5ac38a42b7722e83f89698",
  "@type": "activity",
  "atTime": "2019-06-03T11:56:43+02:00",
  "prov:wasAssociatedWith": {
    "@id": "ffa915b768c7d3096081265387bdaa4b"
  },
  "rdfs:comment": "[DATALAD RUNCMD] FSL FEAT analysis config script"
}
{
  "@id": "4bee9bbcfee183250c5ac38a42b7722e83f89698",
  "@type": "activity",
  "atTime": "2019-06-03T11:56:43+02:00",
  "prov:wasAssociatedWith": {
    "@id": "ffa915b768c7d3096081265387bdaa4b"
  },
  "rdfs:comment": "[DATALAD RUNCMD] FSL FEAT analysis config script"
}

Needless to say that this is not how query should work in the end, but the information is in the beast, and it is consolidated across three different metadata sources (datalad-core, PROV on datalad's own command capture, and NIDM-R). By enriching the information in each one of them alone, one could achieve substantial information retrieval capabilties.

mih commented 5 years ago

@cmaumet @jbpoline which of the metadata-related bits (last section) should appear on the poster, or rather, how to tweak them to become relevant?

@dkeator NIDM-E would be next on my TODO list, in order to bring proper metadata on the analysis inputs into the picture ... likely post-OHBM