psychoinformatics-de / datalad-hirni

DataLad extension for (semi-)automated, reproducible processing of (medical/neuro)imaging data
http://datalad.org
Other
5 stars 8 forks source link

RF: prevent storage of metadata in dicom subdatasets #170

Open loj opened 3 years ago

loj commented 3 years ago

I am attempting to remove any metadata aggregation done in the dicom subdatasets. The goal is to keep them "pristine" and store any aggregated metadata in the top-level superdataset.

Currently, there are 2 places where hirni-import dicom aggregates metadata. Once in the function _guess_acquisition_and_move, in order to determine the session/acquisition if not provided (here the metadata is stored in the dicom subdataset), and then again right before calling dicom2spec (here the metadata is stored it in the top-level superdataset).

So far, this PR replaces metadata_aggregate done in _guess_acquisition_and_move with metadata_extract in order to avoid storage of metadata in the dicom subdataset.

Something I haven't figured out yet, is how to avoid the configuration setup done in the dicom subdataset. Currently, when the dicom subdataset is created, hirni configures the dicom metadata extractor.

importds.config.add(                                                        
    var="datalad.metadata.nativetype",                                      
    value="dicom",                                                          
    where="dataset"                                                         
 )                                                                           
importds.config.add(                                                        
    var="datalad.metadata.aggregate-content-dicom",                         
    value='false',                                                          
    where="dataset")                                                        
# TODO: file an issue: config.add can't convert False to 'false' on its own 
# (But vice versa while reading IIRC)                                       

importds.config.add(                                                        
    var="datalad.metadata.maxfieldsize",                                    
    value='10000000',                                                       
    where="dataset")                                                        
importds.save(op.join(".datalad", "config"),                                
              message="[HIRNI] initial config for DICOM metadata")

Ideally, we wouldn't need this configuration and could keep the dicom subdataset "untouched" after the dicom data is extracted. metadata_extract allows me to get around this configuration by declaring the extractor to be executed through sources. But, metadata_aggregate doesn't have an option to declare the extractor type in the command call, and rather requires the extractor to be enabled via the configuration file.

This PR is a step forward as is in that it stores metadata in the top-level superdataset only, but hirni-import-dicom still "dirties" the dicom subdatasets via the metadata extractor configuration. Thoughts @mih and @bpoldrack?

related #117

bpoldrack commented 3 years ago

Thoughts @mih and @bpoldrack?

Technically that change is fine. However, while it's nice if you don't want metadata in the dicom datasets, it's now preventing it altogether. So anyone who does want it there, would need to extract metadata again. Thinking of a setup where those datasets are part of other superdatasets, too (say: collections of everything acquired at a particular site). Now the extraction would need to be done for every superdataset instead of just the aggregation. Extraction can be rather expensive, so if you want to aggregate (or use the metadata as in the acquisition guessing) you wouldn't want to run it several times. So, we need to turn this idea into something that comes with a switch or a configuration.

There'd be a different approach for you, of course. You could not use hirni-import-dcm at all, but create those subdatasets yourself, use add-archive-content to import the dicoms, aggregate metadata to top and finally run hirni-dicom2spec on them.

Ultimately need to split the import command into several steps, that can be called either individually or as a whole via the import command.