psychoinformatics-de / studyforrest-data

DataLad superdataset of all studyforrest.org project dataset components
https://studyforrest.org
9 stars 2 forks source link

Establish studyforrest-data-raw #34

Open mih opened 3 years ago

mih commented 3 years ago

Aiming to be a superdataset for targeted subdatasets for each "study". These studies were internally called

These names correspond to folders in the original datastructure on the cluster. They contain the pristine data artifacts and can never be made public, due to data protection regulations.

There are at least two more "raw" datasets (multires3T and multires7T), but their DICOM data are not readily accessible ATM.

bpoldrack commented 3 years ago

I'm working on building this dataset with subdatasets 7T_ad, pandorra, anatomy for now. Not entirely clear whether and how we want to reflect the notion of phase1. Three options:

loj commented 3 years ago

I would lean towards option 2

at the level of converted, anonymized BIDS datasets only as a partial conversion of studyforrest-data-raw (BIDS dataset would then have those three subdatasets under sourcedata)

At this level, then, we could also maintain the data representation as described in papers in a separate branch. #5

bpoldrack commented 3 years ago

At this level, then, we could also maintain the data representation as described in papers in a separate branch. #5

True, but independent on how we reference the raw data at the level of a notion like phase1.

The "issue" with 2) would be dataset level files like README, dataset_description.json and so on. Current approach would be to have them in the raw dataset and use a "copy-converter" for the respective BIDS dataset. If we don't have a phase1-raw location (1 or 3), where would those things live? They could, of course, be created/added at the BIDS level only. Not sure whether there are things at the phase1 abstraction, where this wouldn't work (b/c anonymization or whatever), though.

Approach 1 would be a special case for phase1, since other, possibly overlapping superdatasets can't be addressed the same way. So, I lean towards 3) as the most flexible thing that seems likely to generalize as an approach for other subsamples of studyforrest-data-raw. WDYT, @mih ?

bpoldrack commented 3 years ago

Adapted the scripts/approach to build this.

First trial of building the (sub)datasets finished: /data/project/studyforrest_phase1/pandora /data/project/studyforrest_phase1/anatomy /data/project/studyforrest_phase1/7T_ad

Initial setup of them was done by /data/project/studyforrest_phase1/build-forrest/studyforrest-data-raw-sh. Actual data import + spec editing was done by their respective build script in each dataset's code/creation.

bpoldrack commented 3 years ago

The three datasets pandora, 7T_ad and anatomy require a verification of being what we want them to be. That is: They are supposed to capture all relevant raw data of those "studies" (independent on what should be converted in what context). This requires knowledge of what exactly that means. How do we approach this, @mih?

bpoldrack commented 3 years ago

Additionally, I have now created /data/project/studyforrest_phase1/scientific-data-2014-raw, that contains those three as subdatasets, since we wanted to aim for publications being the targets for converted datasets. Currently the first conversion run based on this dataset is running in /data/project/studyforrest_phase1/scientific-data-2014-bids.

Adjusting the specs and checking what may be missing from the converted dataset, will require some kind of target definition to compare to. Is this supposed to be the release_openfmri1 tag in anondata or is there something else to base the adjustments on, @mih?

bpoldrack commented 3 years ago

Re raw data capturing:

mih commented 3 years ago

OK, I made a first push into this project. It contains the majority of the pieces that are needed to build studyforrest-data-raw or hirni or whatever the name will be -- in the artifact/ directory.

mih commented 3 years ago

@bpoldrack can you please post the link to the generated raw datasets?

bpoldrack commented 3 years ago

@mih

/data/project/studyforrest_phase1/pandora
/data/project/studyforrest_phase1/anatomy
/data/project/studyforrest_phase1/7T_ad