Establish studyforrest-data-raw

mih commented 3 years ago

Aiming to be a superdataset for targeted subdatasets for each "study". These studies were internally called

7T_ad
pandorra
anatomy
fg_eyegaze_raw
3T_av_et
3T_visloc

These names correspond to folders in the original datastructure on the cluster. They contain the pristine data artifacts and can never be made public, due to data protection regulations.

There are at least two more "raw" datasets (multires3T and multires7T), but their DICOM data are not readily accessible ATM.

bpoldrack commented 3 years ago

I'm working on building this dataset with subdatasets 7T_ad, pandorra, anatomy for now. Not entirely clear whether and how we want to reflect the notion of phase1. Three options:

within this dataset (just directory?), so it's clear by hierarchy that 7T_ad, pandorra, anatomy are its parts
at the level of converted, anonymized BIDS datasets only as a partial conversion of studyforrest-data-raw (BIDS dataset would then have those three subdatasets under sourcedata)
intermediate dataset that would then be converted and would be at the "same level" as studyforrest-data-raw, referencing a subset of its subdatasets

loj commented 3 years ago

I would lean towards option 2

at the level of converted, anonymized BIDS datasets only as a partial conversion of studyforrest-data-raw (BIDS dataset would then have those three subdatasets under sourcedata)

At this level, then, we could also maintain the data representation as described in papers in a separate branch. #5

bpoldrack commented 3 years ago

At this level, then, we could also maintain the data representation as described in papers in a separate branch. #5

True, but independent on how we reference the raw data at the level of a notion like phase1.

The "issue" with 2) would be dataset level files like README, dataset_description.json and so on. Current approach would be to have them in the raw dataset and use a "copy-converter" for the respective BIDS dataset. If we don't have a phase1-raw location (1 or 3), where would those things live? They could, of course, be created/added at the BIDS level only. Not sure whether there are things at the phase1 abstraction, where this wouldn't work (b/c anonymization or whatever), though.

Approach 1 would be a special case for phase1, since other, possibly overlapping superdatasets can't be addressed the same way. So, I lean towards 3) as the most flexible thing that seems likely to generalize as an approach for other subsamples of studyforrest-data-raw. WDYT, @mih ?

bpoldrack commented 3 years ago

Adapted the scripts/approach to build this.

First trial of building the (sub)datasets finished: /data/project/studyforrest_phase1/pandora /data/project/studyforrest_phase1/anatomy /data/project/studyforrest_phase1/7T_ad

Initial setup of them was done by /data/project/studyforrest_phase1/build-forrest/studyforrest-data-raw-sh. Actual data import + spec editing was done by their respective build script in each dataset's code/creation.

bpoldrack commented 3 years ago

The three datasets pandora, 7T_ad and anatomy require a verification of being what we want them to be. That is: They are supposed to capture all relevant raw data of those "studies" (independent on what should be converted in what context). This requires knowledge of what exactly that means. How do we approach this, @mih?

bpoldrack commented 3 years ago

Additionally, I have now created /data/project/studyforrest_phase1/scientific-data-2014-raw, that contains those three as subdatasets, since we wanted to aim for publications being the targets for converted datasets. Currently the first conversion run based on this dataset is running in /data/project/studyforrest_phase1/scientific-data-2014-bids.

Adjusting the specs and checking what may be missing from the converted dataset, will require some kind of target definition to compare to. Is this supposed to be the release_openfmri1 tag in anondata or is there something else to base the adjustments on, @mih?

bpoldrack commented 3 years ago

Re raw data capturing:

anatomy looks good as far as I can tell, except for two directories: Under /data/project/studyforrest/anatomy/data two subjects have an orig folder in addition to raw/dicom. Content looks like a conversion result, but I'm not sure. Does this need to be captured, @mih ?
As for pandora: /data/project/studyforrest/pandora shows logs, pmc.tar.gz and swaroop that aren't currently captured. What are those, @mih and are those things anyhow associated with certain acquisitions? I have an old TODO note, claiming I need logs and logs/raw somehow. Not sure what to make of this distinction.
7T_ad:
- The data folder in /data/project/studyforrest/7T_ad has behav subdirectories. I guess, they need to be sucked in. Do they require some kind of conversion? Are they just copied into the converted dataset? If so, where? Old note on the issue, that I can't fully decode ATM:
  
  import behav data into first acq per subject from /data/project/studyforrest/7T_ad/ad_data/${sub}* => the same as behav/; Two files are copied to behav/ + two more per subject.
- Additionally there's ad_data. What about that?

mih commented 3 years ago

OK, I made a first push into this project. It contains the majority of the pieces that are needed to build studyforrest-data-raw or hirni or whatever the name will be -- in the artifact/ directory.

mih commented 3 years ago

@bpoldrack can you please post the link to the generated raw datasets?

bpoldrack commented 3 years ago

@mih

/data/project/studyforrest_phase1/pandora
/data/project/studyforrest_phase1/anatomy
/data/project/studyforrest_phase1/7T_ad

psychoinformatics-de / studyforrest-data

Establish studyforrest-data-raw #34