FR: batch import dicoms of multiple acquisitions

psychoinformatics-de / datalad-hirni

DataLad extension for (semi-)automated, reproducible processing of (medical/neuro)imaging data

http://datalad.org

Other

5 stars 8 forks source link

FR: batch import dicoms of multiple acquisitions #148

Open pvavra opened 4 years ago

pvavra commented 4 years ago

If importing a multiple tarballs (of multiple subjects), it would be convenient to have a "batch mode" for calling the hirni-import-dcm.

I guess how precisely to specify this might vary substantially between circumstances, but then a simple helper-script template might be convenient.

pvavra commented 4 years ago

I've written a simple procedure, which tries to achieve the above. Not really robust, but it seems to work for our specific use-case.

pvavra commented 4 years ago

@bpoldrack I also have a conceptual question: Running the imports in parallel results in "interleaved" commits (each import seems to generate 3 separate commits: one for the dicoms, one for the specs, and one for the updated metadata).

Do you foresee any issues we could run into doing this? Maybe during the metadata aggregation step?

pvavra commented 4 years ago

So, running several imports in parallel doesn't seem to work well.

I noticed two main issues:

failed to create all studyspec.json files for all acquisitions - this is a major issue
commit messages are "mixed", as ds.save() calls do not use path=..

Using a structure like the datalad run --explicit call should make this work in parallel, assuming that no two imports are targeting the same dicoms folder.

To assert the latter part, it would be good to have the hirni-import-dcm call handle the submission of jobs to condor instead of using the --pbs-runner condor argument. This way, some basic sanity checks could be run over the whole set of imports. Then, the ds.save could use the aforementioned path=.. argument to be sure to only add the relevant files.

bpoldrack commented 4 years ago

commit messages are "mixed", as ds.save() calls do not use path=..

Agree. save calls - particularly in the superdataset - should do that. Otherwise it could commit intermediate states of other imports running in parallel.

Do you foresee any issues we could run into doing this? Maybe during the metadata aggregation step?

Metadata aggregation could have very similar issue as those save calls. It should "fix itself" with the last run, but I guess it's safer to properly account for that in hirni-import-dcm.

failed to create all studyspec.json files for all acquisitions - this is a major issue

That's interesting as I don't instantly see, where this issue is emerging from.

Generally, importing should be easier to parallelize - I agree. Once at it, addressing this should also include to allow for import of several archives into the same acquisition and support update of an already imported archive (which currently would be doable only by use of more low-level tools). Not quite sure about the condor related part yet. There might be a better way making use of https://github.com/datalad/datalad-htcondor. Need to think that through. Ideally we can come up with something that generalizes beyond condor.