psychoinformatics-de / datalad-hirni

DataLad extension for (semi-)automated, reproducible processing of (medical/neuro)imaging data
http://datalad.org
Other
5 stars 8 forks source link

repeated call of spec4anything for a file will add create repeat entries #179

Open pvavra opened 3 years ago

pvavra commented 3 years ago

As intended, calling datalad hirni-spec4anything to add a new file (behavioral logfile) to an existing spec creates an new entry of type events_file.

However, simply repeating the call again, creates another entry with the same values below, creating a duplicate.

Instead, it should overwrite any settings, like a repeated call to dicom2specs does.

I guess that a simple search by {{location}} should be enough to identify whether an entry already exists in the specs or not, but am note sure about that..

bpoldrack commented 3 years ago

You can overwrite. This requires a few things:

The problem is that hirni-spec4anything is intended to not create an entire spec, but a single entry (not exactly like dicom2spec). So, if you have a specfile with multiple entries already, this commands needs to somehow figure out when to append with a new entry and when to overwrite (and which one to overwrite).

I guess that a simple search by {{location}} should be enough to identify whether an entry already exists in the specs or not, but am note sure about that..

I don't think location is sufficient. You may want to feed one location into multiple conversion routines:

For the same reasons adding type might be insufficient. So I went with id in addition. Note, that in principle you are free to choose id values for your entries. Filling it with the SeriesNumber of DICOM image series via the default rules is just that: a default.

An alternative could be to change the --replace switch to take a value that indicates how to identify the to be overwritten entries, like: datalad hirni-spec4anything ... --replace "field, names, tomatch". What do you think, @pvavra, would something like this make sense to you (and be useful)?

Edit:

In addition, with such a change the default value for that replace could be the current type, location and id, so that by default spec4anything would overwrite whenever those three match an existing entry.

pvavra commented 3 years ago

I missed the flag --replace. I'll look into whether it solved the issue for me.

However, I don't think the current (implicit) design is a good idea, despite seeing where it's coming from. Specifically, for the following reason:

In the current version, hirni does not seem to have an intended workflow for sessions, where no imaging data had been collected. This might, for example, happen in multi-session studies where only on some days MRI data are acquired, but additional behavioral data gets collected on other days (I'm actively involved in at least one such project atm..).

Also for those kind of projects, I would want to first dump all the raw data into sourcedata and keep each acquisition (session) independent of the other ones by using a new acquisition directory - as a user, this is the way I organize my thinking about the data, namely chronologically based on acquisition sessions.

However, since no dicom-import can be used to create the skeleton specfile, which would be amended by spec4anything, this does not work (well), as far as I can tell, for sessions without MRI data. As a workaround, we have previously simply added the behavioral files (in a zipfile) into a later MRI-acquisition session (in addition to that session's behavior logfiles). Then our converter was able to tell based on the zipfile (name or its content) which session the data actually came from and reformat it according to bids.

But this goes against the "add each acquisition/session into one folder" mentality, which I implicitly assumed when populating sourcedata. My hunch is that session is also the dominant organizational level other researchers tend to think about their (raw) data. Indeed, for the sessions with dicoms, the implied workflow is to dump all non-dicom files inside the acquisition folder, parallel to the dicoms and use one specfile for everything.

Now, I understand that one file/location might need to be processed by multiple converters. However, from what I can see, no such default converters exist currently in hirni. That is, none are able to selectively process a portion of a location (but I haven't looked closely) except for the obvious dicoms-converter.. For all custom converters I write, I can simply create a wrapper which I hook up into the specfile. That wrapper, then, could simply call the different converters I need in turn.

[In a sense, this is what I already do for the behavioral-logfile converter for my other 2 session project: I have a zipfile for both sessions, and my script checks for it's content and triggers one or the other converter of the matlab logfiles, depending what content the zipfile has (e.g. 10 runs on day1, but only 4 on day2). but I call only one converter from the specs.]

Or have I misunderstood something fundamentally here?

Related to the intended workflow: is anyone actually using the spec4anything with multiple converters for one location?

In any case, since the import-dicom call does replace all specs, maybe spec4anything should also default to actively replacing, and add a flag for not replacing instead..

bpoldrack commented 3 years ago

However, since no dicom-import can be used to create the skeleton specfile, which would be amended by spec4anything, this does not work (well), as far as I can tell, for sessions without MRI data.

It is not clear to me, what isn't working (well) for you in such a case. Trying with just some dummy doesn't show any issue to me:

(hirni-dev) ben@tree in ~
❱ datalad create -c hirni some
[INFO   ] Creating a new annex repo at /home/ben/some 
[INFO   ] Scanning for unlocked files (this may take some time) 
[INFO   ] Running procedure cfg_hirni 
[INFO   ] == Command start (output follows) ===== 
[WARNING] Found no aggregated metadata info file /home/ben/some/.datalad/metadata/aggregate_v1.json. You will likely need to either update the dataset from its original location or reaggregate metadata locally.                                                                                
[WARNING] Dataset at . contains no aggregated metadata on this path [metadata(/home/ben/some)] 
[WARNING] Could not obtain dataset metadata, proceeding without 
[INFO   ] Scanning for unlocked files (this may take some time)                                                                                  
[INFO   ] Remote origin not usable by git-annex; setting annex-ignore                                                                            
[INFO   ] == Command exit (modification check follows) =====                                                                                     
Created dataset at /home/ben/some.                                                                                                               
datalad create -c hirni some  3.89s user 2.41s system 68% cpu 9.243 total
(hirni-dev) ben@tree in ~
❱ cd some 
(hirni-dev) ben@tree in ~/some on git:master
❱ ll
total 40
drwxr-xr-x  5 ben ben 4096 Mar 17 10:36 .
drwxr-xr-x 55 ben ben 4096 Mar 17 10:36 ..
drwxr-xr-x  3 ben ben 4096 Mar 17 10:36 code
drwxr-xr-x  2 ben ben 4096 Mar 17 10:36 .datalad
-rw-r--r--  1 ben ben  225 Mar 17 10:36 dataset_description.json
drwxr-xr-x  9 ben ben 4096 Mar 17 10:36 .git
-rw-r--r--  1 ben ben  242 Mar 17 10:36 .gitattributes
-rw-r--r--  1 ben ben  249 Mar 17 10:36 .gitmodules
-rw-r--r--  1 ben ben  249 Mar 17 10:36 README
-rw-r--r--  1 ben ben  830 Mar 17 10:36 studyspec.json
(hirni-dev) ben@tree in ~/some on git:master
❱ mkdir no-dicom-acq
(hirni-dev) ben@tree in ~/some on git:master
❱ touch no-dicom-acq/somefile
(hirni-dev) ben@tree in ~/some on git:master
❱ datalad hirni-spec4anything --properties '{"type": "behavioral", "id": 1}' no-dicom-acq/somefile 
(hirni-dev) ben@tree in ~/some on git:master
❱cd no-dicom-acq 
(hirni-dev) ben@tree in ~/some/no-dicom-acq on git:master
❱ ll
total 12
drwxr-xr-x 2 ben ben 4096 Mar 17 10:38 .
drwxr-xr-x 6 ben ben 4096 Mar 17 10:37 ..
-rw-r--r-- 1 ben ben    0 Mar 17 10:37 somefile
-rw-r--r-- 1 ben ben  277 Mar 17 10:38 studyspec.json
(hirni-dev) ben@tree in ~/some/no-dicom-acq on git:master
❱ cat studyspec.json 
{"comment":{"approved":false,"value":""},"dataset-id":"21be1eef-79cb-4314-ad76-4da505049c81","dataset-refcommit":"b69e6edf0a13693697a15fba26d1f91d69095d52","id":{"approved":true,"value":1},"location":"somefile","procedures":{"approved":false,"value":null},"type":"behavioral"}
(hirni-dev) ben@tree in ~/some/no-dicom-acq on git:master
❱ 

In any case, since the import-dicom call does replace all specs, maybe spec4anything should also default to actively replacing, and add a flag for not replacing instead..

Need to double-check, but to me that sounds more like import-dicom shouldn't replace everything. After all, one could have valid specs for other data modalities in the acquisition already, and import-dicom shouldn't kill them.

bpoldrack commented 3 years ago

Also for those kind of projects, I would want to [...] keep each acquisition (session) independent of the other ones by using a new acquisition directory - as a user, this is the way I organize my thinking about the data, namely chronologically based on acquisition sessions.

Agree - this is the intended way of thinking about it.

bpoldrack commented 3 years ago

Overall it seems to me, that the trouble arises from the (perceived?) need to create a skeleton specfile. There may well be issues with spec4anything and/or import-dicom/dicom2spec, but it's not quite clear to me yet. There should be no need for some "skeleton". A specfile is just a collection of dicts. There's no need for it to contain some skeleton pieces. Hence, understanding your trouble with non-dicom acquisition seems key to me.

pvavra commented 3 years ago

Ok, I see your "skeleton spec" point. That was a misconception on my side.

I guess the issue I have stems from the folloing: dicom-tarfiles and any other raw files are treated by hirni completely differently.

Importantly, adding the dict of the events file inherits some parts (e.g. subject) from the preceding entries in the specs.

However, no cli arguments of spec4anything exist to add those for a "standalone" file. I know that I can forward it with the --properties argument, but I already use that to specify the behavioral converter. From what I understood, I cannot specify multiple jsons/dicts via that argument (i.e. one constant for all subjects, one mutable containing the subject id). I would need to stitch that beforehand together. Now an anything-import function analog to the import-dicoms one could:

  1. create a acquisition directory, or append to it if already there
  2. let me specify subject and anon-subject fields, in addition to the --properties field
  3. add to an existing spec or create a new one, as needed.

And returning to the original issue of replacing entries

  1. If an entry for that imported file already existed, it would overwrite that entry, like import-dicom (i.e. in a smart way).

Then spec4anthing could do what it is doing, i.e. not replacing the spec-entrys, but appending them. I see the potential use for adding multiple entries for one folder, even though I personally would not use it.

Note that in my case, there could be a simple check whether the proposed new entry matches any existing one (the lines are identical in my case, I think, or at least in any deterministic part they are) and simply not add it then.

bpoldrack commented 3 years ago

Ok - I see your convenience points and the relation to what import-dicoms does. I would probably go for an additional command (like import-anything) instead of fundamentally changing spec4anything, but need to think that through.

From what I understood, I cannot specify multiple jsons/dicts via that argument (i.e. one constant for all subjects, one mutable containing the subject id). I would need to stitch that beforehand together.

Yes, you would need to put the entire JSON dict together instead of specifying single properties separatly. I agree that this is not particularly convenient via command line (although/because it is more flexible).

Note, however, that you can generate that dict by any means you have and write it to file to be passed to --properties, which might be a lot more readable then explicitly passing a JSON structure with all the required quoting via command line. Also: You don't need to go through spec4anything. You can use anything you want to create/modify such a JSON dict and append it to the specfile. For example I tend to use python scripts for editing existing specs:

Here I want to edit specfiles in order to not convert image series that aren't motion corrected (for studyforrest):

$ cat ignore_non_moco.py 
#!/usr/bin/env python

from datalad.support import json_py
import posixpath
from datalad_hirni.support.spec_helpers import (
    has_specval,
    get_specval
)
import sys

def mark_ignored(path):

    spec = [r for r in json_py.load_stream(path)] if posixpath.exists(path) else list()

    # ignore non-MoCo series (except angiography => ToF):
    for snippet in spec:
        if snippet["type"] == "dicomseries" and \
          has_specval(snippet, 'description') and \
          "MoCoSeries" not in get_specval(snippet, 'description') and \
          "ToF" not in get_specval(snippet, 'description'):
            print("ignore %s" % get_specval(snippet, 'description'))
            if 'tags' in snippet.keys():
                snippet['tags'].append('hirni-dicom-converter-ignore')
            else:
                snippet['tags'] = ['hirni-dicom-converter-ignore']

    json_py.dump2stream(spec, path)

if __name__ == '__main__':
    mark_ignored(sys.argv[1])

This is then called from within a shell script (that does more than that, of course) running in the study dataset:

for spec in */studyspec.json; do
    ignore_non_moco.py ${spec}
done
datalad save */studyspec.json -m "Don't convert non-motion corrected dicomseries and aborted ones"

So, if you use spec4anything just to create base snippets specifying location, type, may be an id, that kind of further editing may be easier to write, read and change.

However, as I said above: I see the point and I'll have a closer look on how to change or add a command that covers what you outlined.