Open m-wierzba opened 3 years ago
FTR: With respect to re-converting phase1
and keep it as close as possible to openneuro, it would be good to double-check task-labels for example. /data/project/studyforrest/anondata/task_key.txt
lists numbered tasks and previous attempts to re-build phase1
use the labels aomovie
and pandora
. Probably good to settle on common terms?
Our local copy of the OpenNeuro dataset lives here:
/data/project/studyforrest/openneuro/ds000113
Based on the disk usage inspection, I assume that the download was successful:
cynamon@juseless in /data/project/studyforrest/openneuro/ds000113 on git:master
❱ du -h -s
423G .
To make sure that the "S3 bucket error" resulted in any content missing, I have run datalad get
again, but this produced in no output (suggesting that no content is missing). The "S3 bucket error" is still present and has been reported to the OpenNeuro people (thx, @adswa!).
The goal now is to compare the current state of the OpenNeuro dataset against the two datasets that we are primarily interested in putting back to shape (i.e. the so called "phase1" and "phase2" datasets).
The location of the OpenNeuro data:
/data/project/studyforrest/openneuro/ds000113
The location of the "phase1" data:
/data/project/studyforrest/anondata
The location of the "phase2" data:
/data/project/studyforrest/phase2
We want to generate lists of what's common and lists of what's unique:
openneuro
and anondata
common contentopenneuro
and phase2
common contentannondata
, but not in openneuro
(and vice versa)phase2
, but not in openneuro
(and vice versa)My approach to the problem would be to compare the sha
signatures:
sha
list, e.g.:cynamon@juseless in /data/project/studyforrest/openneuro/ds000113 on git:master
❱ find . -type f -print0 | xargs -0 sha1sum | sort > /home/cynamon/openneuro-sha-sorted
Make sure that there are no sha
duplicates (unlikely) in any of the three files.
To obtain common content for any of the two compared datasets, I would use join
e.g.:
join openneuro-sha-sorted anondata-sha-sorted
That would result in a list of the following structure:
sha1 <location1> <location2>
0007c0b19bfafa1c2a731b56b60821b8d3c857b7 ./.git/annex/objects/3K/GF/MD5E-s14734--dbbd13a005405696670bda336f96fc99.txt/MD5E-s14734--dbbd13a005405696670bda336f96fc99.txt ./.git/annex/objects/2z/0w/SHA256E-s14734--0204eaef07f3b1658af1fdc7427e1d1863bd50d164dc07a678220bfaa2338ed2.txt/SHA256E-s14734--0204eaef07f3b1658af1fdc7427e1d1863bd50d164dc07a678220bfaa2338ed2.txt
004f0b7de99745faf9a8feecce336df579ec47a7 ./.git/annex/objects/Xw/G1/MD5E-s176248186--31fc064ca9db4b4205d6b0dcc5d98ad4.nii.gz/MD5E-s176248186--31fc064ca9db4b4205d6b0dcc5d98ad4.nii.gz ./.git/annex/objects/5z/xp/SHA256E-s176248186--9dc89b01eaf861b077f3aa3bfa5718114ba6286aa853d39fc18059f3b0fc971e.nii.gz/SHA256E-s176248186--9dc89b01eaf861b077f3aa3bfa5718114ba6286aa853d39fc18059f3b0fc971e.nii.gz
0067621cd20d3579f5a472dfa5acff0a9648ab32 ./.git/annex/objects/V6/qF/MD5E-s625777885--f8fbb1b93a4b808fe43a7c788e60e345.nii.gz/MD5E-s625777885--f8fbb1b93a4b808fe43a7c788e60e345.nii.gz ./.git/annex/objects/w9/f5/SHA256E-s625777885--8c69b107ed9ff6a3bf710d050fa34c47ef6a6412a2e0bd3d8e8bcd6fcde1f863.nii.gz/SHA256E-s625777885--8c69b107ed9ff6a3bf710d050fa34c47ef6a6412a2e0bd3d8e8bcd6fcde1f863.nii.gz
comm
. Probably, I would need to cut the the file locations first, compare the sha
signatures only and the glue the file locations back.Looks sane to me :)
And I'm very much interested in this:
That would result in a list of the following structure:
sha1 <location1> <location2>
for comparing to fresh conversion.
I noticed something confusing to me:
❱ cat /data/project/studyforrest/openneuro/ds000113/recording-cardresp_physio.json
{
"SamplingFrequency": 500.0,
"StartTime": 0.0,
"Columns": [
"trigger",
"cardiac",
"respiratory"
],
"ContentDescription": "Activity recorded with a pulse oximeter (cardiac) and respiration belt."
}
This is the only sidecar file I could find, referring recording-cardresp_physio
. cardresp
is the label "we" used for phase1. However, all I have there are sampling frequencies that are either 100 or 200. Not 500. What's up with that @mih ?
The final approach (proposed by @mih) to compare the datasets in a more human-readable fashion is the following.
We first generate a list of md5
sums in the following way:
md5sum $(git annex find) | tee /tmp/openneuro.md5
md5sum $(git annex find --branch release_openfmri1) | tee /tmp/openfmri.md5
We then compare the resulting output files with a python script /data/project/studyforrest/openneuro/match.py
, created by @mih:
import sys
d = {}
for f in sys.argv[1:]:
for line in open(f):
line = line.rstrip('\n')
checksum, path = line.split(" ", 1)
rec = d.get(checksum, [])
rec.append('{}::{}'.format(f, path))
d[checksum] = rec
for k, v in d.items():
print('{}: {}'.format(k, v))
To run the script:
python match.py openneuro.md5 openfmri.md5
cynamon@juseless in /data/project/studyforrest/openneuro
❱ head -3 latestopenneuro_vs_openfmrirelease1.txt
a4ca2772ab82a4d422afac2898f34af6: ['/tmp/openfmri.md5:: acquisition_protocols/04-sT1W_3D_TFE_TR2300_TI900_0_7iso_FS.txt', '/tmp/openneuro.md5:: sourcedata/acquisition_protocols/04-sT1W_3D_TFE_TR2300_TI900_0_7iso_FS.txt']
fc4bee45d4ae95fe65e7add53413eda4: ['/tmp/openfmri.md5:: acquisition_protocols/05-sT2W_3D_TSE_32chSHC_0_7iso.txt', '/tmp/openneuro.md5:: sourcedata/acquisition_protocols/05-sT2W_3D_TSE_32chSHC_0_7iso.txt']
d54a120f35975605801f8a1d0dddca2d: ['/tmp/openfmri.md5:: acquisition_protocols/06-VEN_BOLD_HR_32chSHC.txt', '/tmp/openneuro.md5:: sourcedata/acquisition_protocols/06-VEN_BOLD_HR_32chSHC.txt']
md5
entries that:
❱ cat latestopenneuro_vs_openfmrirelease1.txt | grep openfmri | grep openneuro | wc -l
1391
openfmri
), but not in the other (e.g. openneuro
):
❱ cat latestopenneuro_vs_openfmrirelease1.txt | grep openfmri | grep -v openneuro | wc -l
7035
There are some files missing in the open neuro dataset: Subject 5 only has 2 (instead of 8) physio files for the auditoryperception/pandora task:
/data/project/studyforrest/openneuro/ds000113/sub-05/ses-auditoryperception/func on git:master
❱ ls 1 !
sub-05_ses-auditoryperception_task-auditoryperception_run-01_bold.nii.gz sub-05_ses-auditoryperception_task-auditoryperception_run-03_events.tsv sub-05_ses-auditoryperception_task-auditoryperception_run-07_bold.nii.gz
sub-05_ses-auditoryperception_task-auditoryperception_run-01_events.tsv sub-05_ses-auditoryperception_task-auditoryperception_run-04_bold.nii.gz sub-05_ses-auditoryperception_task-auditoryperception_run-07_events.tsv
sub-05_ses-auditoryperception_task-auditoryperception_run-01_physio.tsv.gz sub-05_ses-auditoryperception_task-auditoryperception_run-04_events.tsv sub-05_ses-auditoryperception_task-auditoryperception_run-08_bold.nii.gz
sub-05_ses-auditoryperception_task-auditoryperception_run-02_bold.nii.gz sub-05_ses-auditoryperception_task-auditoryperception_run-05_bold.nii.gz sub-05_ses-auditoryperception_task-auditoryperception_run-08_events.tsv
sub-05_ses-auditoryperception_task-auditoryperception_run-02_events.tsv sub-05_ses-auditoryperception_task-auditoryperception_run-05_events.tsv
sub-05_ses-auditoryperception_task-auditoryperception_run-02_physio.tsv.gz sub-05_ses-auditoryperception_task-auditoryperception_run-06_bold.nii.gz
sub-05_ses-auditoryperception_task-auditoryperception_run-03_bold.nii.gz sub-05_ses-auditoryperception_task-auditoryperception_run-06_events.tsv
There are no files for subject 7:
adina@juseless in /data/project/studyforrest/openneuro/ds000113/sub-07/ses-auditoryperception/func on git:master
❱ ls 1 !
sub-07_ses-auditoryperception_task-auditoryperception_run-01_bold.nii.gz sub-07_ses-auditoryperception_task-auditoryperception_run-04_bold.nii.gz sub-07_ses-auditoryperception_task-auditoryperception_run-07_bold.nii.gz
sub-07_ses-auditoryperception_task-auditoryperception_run-01_events.tsv sub-07_ses-auditoryperception_task-auditoryperception_run-04_events.tsv sub-07_ses-auditoryperception_task-auditoryperception_run-07_events.tsv
sub-07_ses-auditoryperception_task-auditoryperception_run-02_bold.nii.gz sub-07_ses-auditoryperception_task-auditoryperception_run-05_bold.nii.gz sub-07_ses-auditoryperception_task-auditoryperception_run-08_bold.nii.gz
sub-07_ses-auditoryperception_task-auditoryperception_run-02_events.tsv sub-07_ses-auditoryperception_task-auditoryperception_run-05_events.tsv sub-07_ses-auditoryperception_task-auditoryperception_run-08_events.tsv
sub-07_ses-auditoryperception_task-auditoryperception_run-03_bold.nii.gz sub-07_ses-auditoryperception_task-auditoryperception_run-06_bold.nii.gz
sub-07_ses-auditoryperception_task-auditoryperception_run-03_events.tsv sub-07_ses-auditoryperception_task-auditoryperception_run-06_events.tsv
There are no files for subject 18
adina@juseless in /data/project/studyforrest/openneuro/ds000113/sub-18/ses-auditoryperception/func on git:master
❱ ls 1 !
sub-18_ses-auditoryperception_task-auditoryperception_run-01_bold.nii.gz sub-18_ses-auditoryperception_task-auditoryperception_run-04_bold.nii.gz sub-18_ses-auditoryperception_task-auditoryperception_run-07_bold.nii.gz
sub-18_ses-auditoryperception_task-auditoryperception_run-01_events.tsv sub-18_ses-auditoryperception_task-auditoryperception_run-04_events.tsv sub-18_ses-auditoryperception_task-auditoryperception_run-07_events.tsv
sub-18_ses-auditoryperception_task-auditoryperception_run-02_bold.nii.gz sub-18_ses-auditoryperception_task-auditoryperception_run-05_bold.nii.gz sub-18_ses-auditoryperception_task-auditoryperception_run-08_bold.nii.gz
sub-18_ses-auditoryperception_task-auditoryperception_run-02_events.tsv sub-18_ses-auditoryperception_task-auditoryperception_run-05_events.tsv sub-18_ses-auditoryperception_task-auditoryperception_run-08_events.tsv
sub-18_ses-auditoryperception_task-auditoryperception_run-03_bold.nii.gz sub-18_ses-auditoryperception_task-auditoryperception_run-06_bold.nii.gz
sub-18_ses-auditoryperception_task-auditoryperception_run-03_events.tsv sub-18_ses-auditoryperception_task-auditoryperception_run-06_events.tsv
and subject 19
adina@juseless in /data/project/studyforrest/openneuro/ds000113/sub-19/ses-auditoryperception/func on git:master
❱ ls 1 !
sub-19_ses-auditoryperception_task-auditoryperception_run-01_bold.nii.gz sub-19_ses-auditoryperception_task-auditoryperception_run-04_bold.nii.gz sub-19_ses-auditoryperception_task-auditoryperception_run-07_bold.nii.gz
sub-19_ses-auditoryperception_task-auditoryperception_run-01_events.tsv sub-19_ses-auditoryperception_task-auditoryperception_run-04_events.tsv sub-19_ses-auditoryperception_task-auditoryperception_run-07_events.tsv
sub-19_ses-auditoryperception_task-auditoryperception_run-02_bold.nii.gz sub-19_ses-auditoryperception_task-auditoryperception_run-05_bold.nii.gz sub-19_ses-auditoryperception_task-auditoryperception_run-08_bold.nii.gz
sub-19_ses-auditoryperception_task-auditoryperception_run-02_events.tsv sub-19_ses-auditoryperception_task-auditoryperception_run-05_events.tsv sub-19_ses-auditoryperception_task-auditoryperception_run-08_events.tsv
sub-19_ses-auditoryperception_task-auditoryperception_run-03_bold.nii.gz sub-19_ses-auditoryperception_task-auditoryperception_run-06_bold.nii.gz
sub-19_ses-auditoryperception_task-auditoryperception_run-03_events.tsv sub-19_ses-auditoryperception_task-auditoryperception_run-06_events.tsv
The pandora/auditoryperception session on open neuro has wrong stimulus file names and does not ship the audio stimulus files.
The events.tsv files from pandora openneuro also messed up the run and run_id association:
from open neuro:
cat sub-01_ses-auditoryperception_task-auditoryperception_run-01_events.tsv
onset duration trial_type run run_id volume run_volume stim genre delay catch sound_soa trigger_ts
0.01 6.0 rocknroll 1 6 0 0 rocknroll_002.wav rocknroll 6 0 0.007200000000011642 1233.5005
12.0 6.0 symphonic 1 6 6 6 symphonic_003.wav symphonic 6 0 0.002899999999954161 1245.4996
24.0 6.0 rocknroll 1 6 12 12 rocknroll_001.wav rocknroll 6 0 0.002499999999827196 1257.4997
36.0 6.0 metal 1 6 18 18 metal_004.wav metal 6 0 0.002600000000029468 1269.5
48.01 6.0 symphonic 1 6 24 24 symphonic_002.wav symphonic 8 1 0.013300000000072032 1281.5003
62.0 6.0 country 1 6 31 31 country_003.wav country 6 0 0.0027000000000043656 1295.4996
74.0 6.0 country 1 6 37 37 country_002.wav country 6 0 0.0025000000000545697 1307.4993
86.0 6.0 ambient 1 6 43 43 ambient_001.wav ambient 6 0 0.002900000000181535 1319.4994
98.01 6.0 ambient 1 6 49 49 ambient_004.wav ambient 8 1 0.007800000000088403 1331.499
112.0 6.0 country 1 6 56 56 country_000.wav country 4 0 0.004100000000107684 1345.4985
122.0 6.0 symphonic 1 6 61 61 symphonic_001.wav symphonic 6 0 0.0025000000000545697 1355.499
134.0 6.0 symphonic 1 6 67 67 symphonic_004.wav symphonic 4 0 0.003600000000005821 1367.4987
144.0 6.0 ambient 1 6 72 72 ambient_003.wav ambient 6 0 0.0035000000000309233 1377.4986
156.0 6.0 metal 1 6 78 78 metal_003.wav metal 4 0 0.0025000000000545697 1389.4985
166.0 6.0 metal 1 6 83 83 metal_000.wav metal 6 0 0.0025000000000545697 1399.4991
178.0 6.0 ambient 1 6 89 89 ambient_000.wav ambient 6 0 0.002600000000029468 1411.4989
190.02 6.0 rocknroll 1 6 95 95 rocknroll_003.wav rocknroll 8 1 0.01510000000007494 1423.4995
204.0 6.0 country 1 6 102 102 country_004.wav country 6 0 0.0027000000000043656 1437.4986
216.0 6.0 rocknroll 1 6 108 108 rocknroll_004.wav rocknroll 4 0 0.002600000000029468 1449.4994
226.0 6.0 ambient 1 6 113 113 ambient_002.wav ambient 4 0 0.002600000000029468 1459.4995
236.0 6.0 symphonic 1 6 118 118 symphonic_000.wav symphonic 6 0 0.0025000000000545697 1469.4986
248.01 6.0 metal 1 6 124 124 metal_002.wav metal 8 1 0.00920000000019172 1481.4986
262.01 6.0 country 1 6 131 131 country_001.wav country 8 1 0.011099999999942156 1495.4987
276.0 6.0 metal 1 6 138 138 metal_001.wav metal 6 0 0.003600000000005821 1509.4989
288.0 6.0 rocknroll 1 6 144 144 rocknroll_000.wav rocknroll 6 0 0.0025000000000545697 1521.4994
from us:
cat /data/project/studyforrest/anondata/sub001/behav/task002_run001/behavdata.txt 1 !
"run","run_id","volume","run_volume","stim","genre","delay","catch","sound_soa","trigger_ts"
1,6,0,0,"rocknroll_002.wav","rocknroll",6,0,0.0072000000000116415,1233.5005
1,6,6,6,"symphonic_003.wav","symphonic",6,0,0.0028999999999541615,1245.4996
1,6,12,12,"rocknroll_001.wav","rocknroll",6,0,0.002499999999827196,1257.4997
1,6,18,18,"metal_004.wav","metal",6,0,0.0026000000000294676,1269.5
1,6,24,24,"symphonic_002.wav","symphonic",8,1,0.013300000000072032,1281.5003
1,6,31,31,"country_003.wav","country",6,0,0.0027000000000043656,1295.4996
1,6,37,37,"country_002.wav","country",6,0,0.0025000000000545697,1307.4993
1,6,43,43,"ambient_001.wav","ambient",6,0,0.002900000000181535,1319.4994
1,6,49,49,"ambient_004.wav","ambient",8,1,0.007800000000088403,1331.499
1,6,56,56,"country_000.wav","country",4,0,0.004100000000107684,1345.4985
1,6,61,61,"symphonic_001.wav","symphonic",6,0,0.0025000000000545697,1355.499
1,6,67,67,"symphonic_004.wav","symphonic",4,0,0.0036000000000058208,1367.4987
1,6,72,72,"ambient_003.wav","ambient",6,0,0.003500000000030923,1377.4986
1,6,78,78,"metal_003.wav","metal",4,0,0.0025000000000545697,1389.4985
1,6,83,83,"metal_000.wav","metal",6,0,0.0025000000000545697,1399.4991
1,6,89,89,"ambient_000.wav","ambient",6,0,0.0026000000000294676,1411.4989
1,6,95,95,"rocknroll_003.wav","rocknroll",8,1,0.015100000000074942,1423.4995
1,6,102,102,"country_004.wav","country",6,0,0.0027000000000043656,1437.4986
1,6,108,108,"rocknroll_004.wav","rocknroll",4,0,0.0026000000000294676,1449.4994
1,6,113,113,"ambient_002.wav","ambient",4,0,0.0026000000000294676,1459.4995
1,6,118,118,"symphonic_000.wav","symphonic",6,0,0.0025000000000545697,1469.4986
1,6,124,124,"metal_002.wav","metal",8,1,0.009200000000191721,1481.4986
1,6,131,131,"country_001.wav","country",8,1,0.011099999999942156,1495.4987
1,6,138,138,"metal_001.wav","metal",6,0,0.0036000000000058208,1509.4989
1,6,144,144,"rocknroll_000.wav","rocknroll",6,0,0.0025000000000545697,1521.4994
@bpoldrack, I'm not sure if this is going to be useful in any way, but you can find some very simple scripts that use nibabel
to compare two NIfTI files here: /data/project/studyforrest/openneuro/unittests-comparison
nibabel-header-compare.py
compares the header information reported by nibabel
:python3 nibabel-header-compare.py <nii1> <nii2>
nibabel-img-compare.py
compares the image data:python3 nibabel-img-compare.py <nii1> <nii2>
Obviously, you need to know what to compare against what. If you have any thoughts on what could be improved for this to be useful, just let me know (pinging @mih).
Just FTR: with regard to the presumed fslreorient2std
issue, I have run and checked that for the subject we've been looking at previously.
anondata
path:
/data/project/studyforrest/anondata/sub001/BOLD/task001_run004/bold.nii.gz
openneuro
path:
/data/project/studyforrest/openneuro/ds000113/sub-01/ses-forrestgump/func/sub-01_ses-forrestgump_task-forrestgump_acq-raw_run-04_bold.nii.gz
I have run fslreorient2std
again on the anondata
file:
cynamon@juseless in /data/project/studyforrest/openneuro/fslhd-comparison
❱ fslreorient2std /data/project/studyforrest/anondata/sub001/BOLD/task001_run004/bold.nii.gz ./reoriented.nii.gz
Now, the FSL header information has been obtained with fslhd
for all three files (anondata
, openneuro
, and the newly created reoriented.nii.gz
):
cynamon@juseless in /data/project/studyforrest/openneuro/fslhd-comparison
❱ ls *hd.txt
newhd.txt oldhd.txt reorientedhd.txt
The conclusion is that fslreorient2std
doesn't seem to have caused the problem:
❱ diff oldhd.txt reorientedhd.txt
1c1
< filename ../anondata/sub001/BOLD/task001_run004/bold.nii.gz
---
> filename reoriented.nii.gz
Cool, that is good to know. Hence there is no point in trying to implement something like this in the conversion. Also the response to my original issues was along the lines of "unclear why". I'd say we stick with the output of the more modern converter the @bpoldrack is using.
This issue is meant to record all aspects related to the current state of the dataset available at OpenNeuro: https://openneuro.org/datasets/ds000113/versions/1.3.0
The dataset has been obtained with:
The local copy of the dataset is stored at:
/data/project/studyforrest/openneuro
.