Current state of the OpenNeuro dataset

This issue is meant to record all aspects related to the current state of the dataset available at OpenNeuro: https://openneuro.org/datasets/ds000113/versions/1.3.0

The dataset has been obtained with:

datalad install https://github.com/OpenNeuroDatasets/ds000113.git

The local copy of the dataset is stored at: /data/project/studyforrest/openneuro.

FTR: With respect to re-converting phase1 and keep it as close as possible to openneuro, it would be good to double-check task-labels for example. /data/project/studyforrest/anondata/task_key.txt lists numbered tasks and previous attempts to re-build phase1 use the labels aomovie and pandora. Probably good to settle on common terms?

Our local copy of the OpenNeuro dataset lives here: /data/project/studyforrest/openneuro/ds000113

Based on the disk usage inspection, I assume that the download was successful:

cynamon@juseless in /data/project/studyforrest/openneuro/ds000113 on git:master
❱ du -h -s
423G    .

To make sure that the "S3 bucket error" resulted in any content missing, I have run datalad get again, but this produced in no output (suggesting that no content is missing). The "S3 bucket error" is still present and has been reported to the OpenNeuro people (thx, @adswa!).

The goal now is to compare the current state of the OpenNeuro dataset against the two datasets that we are primarily interested in putting back to shape (i.e. the so called "phase1" and "phase2" datasets).

The location of the OpenNeuro data: /data/project/studyforrest/openneuro/ds000113

The location of the "phase1" data: /data/project/studyforrest/anondata

The location of the "phase2" data: /data/project/studyforrest/phase2

We want to generate lists of what's common and lists of what's unique:

openneuro and anondata common content
openneuro and phase2 common content
content that is in annondata, but not in openneuro (and vice versa)
content that is in phase2, but not in openneuro (and vice versa)

My approach to the problem would be to compare the sha signatures:

For each of the three datasets, generate a sorted sha list, e.g.:

cynamon@juseless in /data/project/studyforrest/openneuro/ds000113 on git:master
❱ find . -type f -print0 | xargs -0 sha1sum | sort > /home/cynamon/openneuro-sha-sorted

Make sure that there are no sha duplicates (unlikely) in any of the three files.
To obtain common content for any of the two compared datasets, I would use join e.g.:

join openneuro-sha-sorted anondata-sha-sorted

That would result in a list of the following structure:

sha1 <location1> <location2>

0007c0b19bfafa1c2a731b56b60821b8d3c857b7 ./.git/annex/objects/3K/GF/MD5E-s14734--dbbd13a005405696670bda336f96fc99.txt/MD5E-s14734--dbbd13a005405696670bda336f96fc99.txt ./.git/annex/objects/2z/0w/SHA256E-s14734--0204eaef07f3b1658af1fdc7427e1d1863bd50d164dc07a678220bfaa2338ed2.txt/SHA256E-s14734--0204eaef07f3b1658af1fdc7427e1d1863bd50d164dc07a678220bfaa2338ed2.txt
004f0b7de99745faf9a8feecce336df579ec47a7 ./.git/annex/objects/Xw/G1/MD5E-s176248186--31fc064ca9db4b4205d6b0dcc5d98ad4.nii.gz/MD5E-s176248186--31fc064ca9db4b4205d6b0dcc5d98ad4.nii.gz ./.git/annex/objects/5z/xp/SHA256E-s176248186--9dc89b01eaf861b077f3aa3bfa5718114ba6286aa853d39fc18059f3b0fc971e.nii.gz/SHA256E-s176248186--9dc89b01eaf861b077f3aa3bfa5718114ba6286aa853d39fc18059f3b0fc971e.nii.gz
0067621cd20d3579f5a472dfa5acff0a9648ab32 ./.git/annex/objects/V6/qF/MD5E-s625777885--f8fbb1b93a4b808fe43a7c788e60e345.nii.gz/MD5E-s625777885--f8fbb1b93a4b808fe43a7c788e60e345.nii.gz ./.git/annex/objects/w9/f5/SHA256E-s625777885--8c69b107ed9ff6a3bf710d050fa34c47ef6a6412a2e0bd3d8e8bcd6fcde1f863.nii.gz/SHA256E-s625777885--8c69b107ed9ff6a3bf710d050fa34c47ef6a6412a2e0bd3d8e8bcd6fcde1f863.nii.gz

To obtain the unique content for any dataset compared against another dataset, I would use comm. Probably, I would need to cut the the file locations first, compare the sha signatures only and the glue the file locations back.

Looks sane to me :)

And I'm very much interested in this:

That would result in a list of the following structure: sha1 <location1> <location2>

for comparing to fresh conversion.

I noticed something confusing to me:

❱ cat /data/project/studyforrest/openneuro/ds000113/recording-cardresp_physio.json
{
    "SamplingFrequency": 500.0, 
    "StartTime": 0.0, 
    "Columns": [
        "trigger", 
        "cardiac", 
        "respiratory"
    ], 
    "ContentDescription": "Activity recorded with a pulse oximeter (cardiac) and respiration belt."
}

This is the only sidecar file I could find, referring recording-cardresp_physio. cardresp is the label "we" used for phase1. However, all I have there are sampling frequencies that are either 100 or 200. Not 500. What's up with that @mih ?

The final approach (proposed by @mih) to compare the datasets in a more human-readable fashion is the following.

We first generate a list of md5 sums in the following way: md5sum $(git annex find) | tee /tmp/openneuro.md5 md5sum $(git annex find --branch release_openfmri1) | tee /tmp/openfmri.md5
We then compare the resulting output files with a python script /data/project/studyforrest/openneuro/match.py, created by @mih:

import sys

d = {}

for f in sys.argv[1:]:
    for line in open(f):
        line = line.rstrip('\n')
        checksum, path = line.split(" ", 1)
        rec = d.get(checksum, [])
        rec.append('{}::{}'.format(f, path))
        d[checksum] = rec

for k, v in d.items():
    print('{}: {}'.format(k, v))

To run the script: python match.py openneuro.md5 openfmri.md5

The resulting output looks as follows:

cynamon@juseless in /data/project/studyforrest/openneuro
❱ head -3 latestopenneuro_vs_openfmrirelease1.txt
a4ca2772ab82a4d422afac2898f34af6: ['/tmp/openfmri.md5:: acquisition_protocols/04-sT1W_3D_TFE_TR2300_TI900_0_7iso_FS.txt', '/tmp/openneuro.md5:: sourcedata/acquisition_protocols/04-sT1W_3D_TFE_TR2300_TI900_0_7iso_FS.txt']
fc4bee45d4ae95fe65e7add53413eda4: ['/tmp/openfmri.md5:: acquisition_protocols/05-sT2W_3D_TSE_32chSHC_0_7iso.txt', '/tmp/openneuro.md5:: sourcedata/acquisition_protocols/05-sT2W_3D_TSE_32chSHC_0_7iso.txt']
d54a120f35975605801f8a1d0dddca2d: ['/tmp/openfmri.md5:: acquisition_protocols/06-VEN_BOLD_HR_32chSHC.txt', '/tmp/openneuro.md5:: sourcedata/acquisition_protocols/06-VEN_BOLD_HR_32chSHC.txt']

It can be then searched for md5 entries that:

are present in both compared datasets:

❱ cat latestopenneuro_vs_openfmrirelease1.txt | grep openfmri | grep openneuro | wc -l
1391

are present only in one dataset (e.g. openfmri), but not in the other (e.g. openneuro):

❱ cat latestopenneuro_vs_openfmrirelease1.txt | grep openfmri | grep -v openneuro | wc -l   
7035

There are some files missing in the open neuro dataset: Subject 5 only has 2 (instead of 8) physio files for the auditoryperception/pandora task:

/data/project/studyforrest/openneuro/ds000113/sub-05/ses-auditoryperception/func on git:master
❱ ls                                                                                                                                                                                                                               1 !
sub-05_ses-auditoryperception_task-auditoryperception_run-01_bold.nii.gz    sub-05_ses-auditoryperception_task-auditoryperception_run-03_events.tsv   sub-05_ses-auditoryperception_task-auditoryperception_run-07_bold.nii.gz
sub-05_ses-auditoryperception_task-auditoryperception_run-01_events.tsv     sub-05_ses-auditoryperception_task-auditoryperception_run-04_bold.nii.gz  sub-05_ses-auditoryperception_task-auditoryperception_run-07_events.tsv
sub-05_ses-auditoryperception_task-auditoryperception_run-01_physio.tsv.gz  sub-05_ses-auditoryperception_task-auditoryperception_run-04_events.tsv   sub-05_ses-auditoryperception_task-auditoryperception_run-08_bold.nii.gz
sub-05_ses-auditoryperception_task-auditoryperception_run-02_bold.nii.gz    sub-05_ses-auditoryperception_task-auditoryperception_run-05_bold.nii.gz  sub-05_ses-auditoryperception_task-auditoryperception_run-08_events.tsv
sub-05_ses-auditoryperception_task-auditoryperception_run-02_events.tsv     sub-05_ses-auditoryperception_task-auditoryperception_run-05_events.tsv   
sub-05_ses-auditoryperception_task-auditoryperception_run-02_physio.tsv.gz  sub-05_ses-auditoryperception_task-auditoryperception_run-06_bold.nii.gz
sub-05_ses-auditoryperception_task-auditoryperception_run-03_bold.nii.gz    sub-05_ses-auditoryperception_task-auditoryperception_run-06_events.tsv

There are no files for subject 7:

adina@juseless in /data/project/studyforrest/openneuro/ds000113/sub-07/ses-auditoryperception/func on git:master
❱ ls                                                                                                                                                                                                                               1 !
sub-07_ses-auditoryperception_task-auditoryperception_run-01_bold.nii.gz  sub-07_ses-auditoryperception_task-auditoryperception_run-04_bold.nii.gz  sub-07_ses-auditoryperception_task-auditoryperception_run-07_bold.nii.gz
sub-07_ses-auditoryperception_task-auditoryperception_run-01_events.tsv   sub-07_ses-auditoryperception_task-auditoryperception_run-04_events.tsv   sub-07_ses-auditoryperception_task-auditoryperception_run-07_events.tsv
sub-07_ses-auditoryperception_task-auditoryperception_run-02_bold.nii.gz  sub-07_ses-auditoryperception_task-auditoryperception_run-05_bold.nii.gz  sub-07_ses-auditoryperception_task-auditoryperception_run-08_bold.nii.gz
sub-07_ses-auditoryperception_task-auditoryperception_run-02_events.tsv   sub-07_ses-auditoryperception_task-auditoryperception_run-05_events.tsv   sub-07_ses-auditoryperception_task-auditoryperception_run-08_events.tsv
sub-07_ses-auditoryperception_task-auditoryperception_run-03_bold.nii.gz  sub-07_ses-auditoryperception_task-auditoryperception_run-06_bold.nii.gz 
sub-07_ses-auditoryperception_task-auditoryperception_run-03_events.tsv   sub-07_ses-auditoryperception_task-auditoryperception_run-06_events.tsv

There are no files for subject 18

adina@juseless in /data/project/studyforrest/openneuro/ds000113/sub-18/ses-auditoryperception/func on git:master
❱ ls                                                                                                                                                                                                                               1 !
sub-18_ses-auditoryperception_task-auditoryperception_run-01_bold.nii.gz  sub-18_ses-auditoryperception_task-auditoryperception_run-04_bold.nii.gz  sub-18_ses-auditoryperception_task-auditoryperception_run-07_bold.nii.gz
sub-18_ses-auditoryperception_task-auditoryperception_run-01_events.tsv   sub-18_ses-auditoryperception_task-auditoryperception_run-04_events.tsv   sub-18_ses-auditoryperception_task-auditoryperception_run-07_events.tsv
sub-18_ses-auditoryperception_task-auditoryperception_run-02_bold.nii.gz  sub-18_ses-auditoryperception_task-auditoryperception_run-05_bold.nii.gz  sub-18_ses-auditoryperception_task-auditoryperception_run-08_bold.nii.gz
sub-18_ses-auditoryperception_task-auditoryperception_run-02_events.tsv   sub-18_ses-auditoryperception_task-auditoryperception_run-05_events.tsv   sub-18_ses-auditoryperception_task-auditoryperception_run-08_events.tsv
sub-18_ses-auditoryperception_task-auditoryperception_run-03_bold.nii.gz  sub-18_ses-auditoryperception_task-auditoryperception_run-06_bold.nii.gz 
sub-18_ses-auditoryperception_task-auditoryperception_run-03_events.tsv   sub-18_ses-auditoryperception_task-auditoryperception_run-06_events.tsv

and subject 19

adina@juseless in /data/project/studyforrest/openneuro/ds000113/sub-19/ses-auditoryperception/func on git:master
❱ ls                                                                                                                                                                                                                               1 !
sub-19_ses-auditoryperception_task-auditoryperception_run-01_bold.nii.gz  sub-19_ses-auditoryperception_task-auditoryperception_run-04_bold.nii.gz  sub-19_ses-auditoryperception_task-auditoryperception_run-07_bold.nii.gz
sub-19_ses-auditoryperception_task-auditoryperception_run-01_events.tsv   sub-19_ses-auditoryperception_task-auditoryperception_run-04_events.tsv   sub-19_ses-auditoryperception_task-auditoryperception_run-07_events.tsv
sub-19_ses-auditoryperception_task-auditoryperception_run-02_bold.nii.gz  sub-19_ses-auditoryperception_task-auditoryperception_run-05_bold.nii.gz  sub-19_ses-auditoryperception_task-auditoryperception_run-08_bold.nii.gz
sub-19_ses-auditoryperception_task-auditoryperception_run-02_events.tsv   sub-19_ses-auditoryperception_task-auditoryperception_run-05_events.tsv   sub-19_ses-auditoryperception_task-auditoryperception_run-08_events.tsv
sub-19_ses-auditoryperception_task-auditoryperception_run-03_bold.nii.gz  sub-19_ses-auditoryperception_task-auditoryperception_run-06_bold.nii.gz 
sub-19_ses-auditoryperception_task-auditoryperception_run-03_events.tsv   sub-19_ses-auditoryperception_task-auditoryperception_run-06_events.tsv

The pandora/auditoryperception session on open neuro has wrong stimulus file names and does not ship the audio stimulus files.

The events.tsv files from pandora openneuro also messed up the run and run_id association:

from open neuro:

cat sub-01_ses-auditoryperception_task-auditoryperception_run-01_events.tsv 
onset   duration    trial_type  run run_id  volume  run_volume  stim    genre   delay   catch   sound_soa   trigger_ts
0.01    6.0 rocknroll   1   6   0   0   rocknroll_002.wav   rocknroll   6   0   0.007200000000011642    1233.5005
12.0    6.0 symphonic   1   6   6   6   symphonic_003.wav   symphonic   6   0   0.002899999999954161    1245.4996
24.0    6.0 rocknroll   1   6   12  12  rocknroll_001.wav   rocknroll   6   0   0.002499999999827196    1257.4997
36.0    6.0 metal   1   6   18  18  metal_004.wav   metal   6   0   0.002600000000029468    1269.5
48.01   6.0 symphonic   1   6   24  24  symphonic_002.wav   symphonic   8   1   0.013300000000072032    1281.5003
62.0    6.0 country 1   6   31  31  country_003.wav country 6   0   0.0027000000000043656   1295.4996
74.0    6.0 country 1   6   37  37  country_002.wav country 6   0   0.0025000000000545697   1307.4993
86.0    6.0 ambient 1   6   43  43  ambient_001.wav ambient 6   0   0.002900000000181535    1319.4994
98.01   6.0 ambient 1   6   49  49  ambient_004.wav ambient 8   1   0.007800000000088403    1331.499
112.0   6.0 country 1   6   56  56  country_000.wav country 4   0   0.004100000000107684    1345.4985
122.0   6.0 symphonic   1   6   61  61  symphonic_001.wav   symphonic   6   0   0.0025000000000545697   1355.499
134.0   6.0 symphonic   1   6   67  67  symphonic_004.wav   symphonic   4   0   0.003600000000005821    1367.4987
144.0   6.0 ambient 1   6   72  72  ambient_003.wav ambient 6   0   0.0035000000000309233   1377.4986
156.0   6.0 metal   1   6   78  78  metal_003.wav   metal   4   0   0.0025000000000545697   1389.4985
166.0   6.0 metal   1   6   83  83  metal_000.wav   metal   6   0   0.0025000000000545697   1399.4991
178.0   6.0 ambient 1   6   89  89  ambient_000.wav ambient 6   0   0.002600000000029468    1411.4989
190.02  6.0 rocknroll   1   6   95  95  rocknroll_003.wav   rocknroll   8   1   0.01510000000007494 1423.4995
204.0   6.0 country 1   6   102 102 country_004.wav country 6   0   0.0027000000000043656   1437.4986
216.0   6.0 rocknroll   1   6   108 108 rocknroll_004.wav   rocknroll   4   0   0.002600000000029468    1449.4994
226.0   6.0 ambient 1   6   113 113 ambient_002.wav ambient 4   0   0.002600000000029468    1459.4995
236.0   6.0 symphonic   1   6   118 118 symphonic_000.wav   symphonic   6   0   0.0025000000000545697   1469.4986
248.01  6.0 metal   1   6   124 124 metal_002.wav   metal   8   1   0.00920000000019172 1481.4986
262.01  6.0 country 1   6   131 131 country_001.wav country 8   1   0.011099999999942156    1495.4987
276.0   6.0 metal   1   6   138 138 metal_001.wav   metal   6   0   0.003600000000005821    1509.4989
288.0   6.0 rocknroll   1   6   144 144 rocknroll_000.wav   rocknroll   6   0   0.0025000000000545697   1521.4994

from us:

cat /data/project/studyforrest/anondata/sub001/behav/task002_run001/behavdata.txt                                                                                1 !
"run","run_id","volume","run_volume","stim","genre","delay","catch","sound_soa","trigger_ts"
1,6,0,0,"rocknroll_002.wav","rocknroll",6,0,0.0072000000000116415,1233.5005
1,6,6,6,"symphonic_003.wav","symphonic",6,0,0.0028999999999541615,1245.4996
1,6,12,12,"rocknroll_001.wav","rocknroll",6,0,0.002499999999827196,1257.4997
1,6,18,18,"metal_004.wav","metal",6,0,0.0026000000000294676,1269.5
1,6,24,24,"symphonic_002.wav","symphonic",8,1,0.013300000000072032,1281.5003
1,6,31,31,"country_003.wav","country",6,0,0.0027000000000043656,1295.4996
1,6,37,37,"country_002.wav","country",6,0,0.0025000000000545697,1307.4993
1,6,43,43,"ambient_001.wav","ambient",6,0,0.002900000000181535,1319.4994
1,6,49,49,"ambient_004.wav","ambient",8,1,0.007800000000088403,1331.499
1,6,56,56,"country_000.wav","country",4,0,0.004100000000107684,1345.4985
1,6,61,61,"symphonic_001.wav","symphonic",6,0,0.0025000000000545697,1355.499
1,6,67,67,"symphonic_004.wav","symphonic",4,0,0.0036000000000058208,1367.4987
1,6,72,72,"ambient_003.wav","ambient",6,0,0.003500000000030923,1377.4986
1,6,78,78,"metal_003.wav","metal",4,0,0.0025000000000545697,1389.4985
1,6,83,83,"metal_000.wav","metal",6,0,0.0025000000000545697,1399.4991
1,6,89,89,"ambient_000.wav","ambient",6,0,0.0026000000000294676,1411.4989
1,6,95,95,"rocknroll_003.wav","rocknroll",8,1,0.015100000000074942,1423.4995
1,6,102,102,"country_004.wav","country",6,0,0.0027000000000043656,1437.4986
1,6,108,108,"rocknroll_004.wav","rocknroll",4,0,0.0026000000000294676,1449.4994
1,6,113,113,"ambient_002.wav","ambient",4,0,0.0026000000000294676,1459.4995
1,6,118,118,"symphonic_000.wav","symphonic",6,0,0.0025000000000545697,1469.4986
1,6,124,124,"metal_002.wav","metal",8,1,0.009200000000191721,1481.4986
1,6,131,131,"country_001.wav","country",8,1,0.011099999999942156,1495.4987
1,6,138,138,"metal_001.wav","metal",6,0,0.0036000000000058208,1509.4989
1,6,144,144,"rocknroll_000.wav","rocknroll",6,0,0.0025000000000545697,1521.4994

@bpoldrack, I'm not sure if this is going to be useful in any way, but you can find some very simple scripts that use nibabel to compare two NIfTI files here: /data/project/studyforrest/openneuro/unittests-comparison

the nibabel-header-compare.py compares the header information reported by nibabel:

python3 nibabel-header-compare.py <nii1> <nii2>

the nibabel-img-compare.py compares the image data:

python3 nibabel-img-compare.py <nii1> <nii2>

Obviously, you need to know what to compare against what. If you have any thoughts on what could be improved for this to be useful, just let me know (pinging @mih).

Just FTR: with regard to the presumed fslreorient2std issue, I have run and checked that for the subject we've been looking at previously.

anondata path: /data/project/studyforrest/anondata/sub001/BOLD/task001_run004/bold.nii.gz

openneuro path: /data/project/studyforrest/openneuro/ds000113/sub-01/ses-forrestgump/func/sub-01_ses-forrestgump_task-forrestgump_acq-raw_run-04_bold.nii.gz

I have run fslreorient2std again on the anondata file:

cynamon@juseless in /data/project/studyforrest/openneuro/fslhd-comparison
❱ fslreorient2std /data/project/studyforrest/anondata/sub001/BOLD/task001_run004/bold.nii.gz ./reoriented.nii.gz

Now, the FSL header information has been obtained with fslhd for all three files (anondata, openneuro, and the newly created reoriented.nii.gz):

cynamon@juseless in /data/project/studyforrest/openneuro/fslhd-comparison
❱ ls *hd.txt
newhd.txt  oldhd.txt  reorientedhd.txt

The conclusion is that fslreorient2std doesn't seem to have caused the problem:

❱ diff oldhd.txt reorientedhd.txt                      
1c1
< filename       ../anondata/sub001/BOLD/task001_run004/bold.nii.gz
---
> filename       reoriented.nii.gz

Cool, that is good to know. Hence there is no point in trying to implement something like this in the conversion. Also the response to my original issues was along the lines of "unclear why". I'd say we stick with the output of the more modern converter the @bpoldrack is using.

psychoinformatics-de / studyforrest-data

Current state of the OpenNeuro dataset #35