Closed kousu closed 4 years ago
While logged onto my admin account on the dataset server (since it's a convenient, and ethics-secure place, and with a conveniently fast network to be doing the migrations), follow the instructions at https://www.neuro.polymtl.ca/internal_resources/list_of_computers#duke to get access to this data:
nguenther@data:~$ sudo mount -t cifs //duke.neuro.polymtl.ca/temp /mnt/duke/temp -o username=u[REDACTED],noexec
[sudo] password for nguenther:
🔐 Password for u[REDACTED]@//132.207.65.200/temp: (no echo)
nguenther@data:~$ ls /mnt/duke/temp/
alfoi andreanne charley EvaAlonsoOrtiz harris jcohen lrouhier mariehbourget matlab olivier qc_feedback sebeda uk_biobank_BIDS zougloub
nguenther@data:~$ du -hs /mnt/duke/temp/uk_biobank_BIDS/
9.6G /mnt/duke/temp/uk_biobank_BIDS/
Import it to local storage, because we suspect using git/git-annex/datalad over smb is glitchy (#12, #15?, #16):
nguenther@data:~$ time rsync -a /mnt/duke/temp/uk_biobank_BIDS/ datasets/uk-biobank
real 2m31.438s
user 0m8.327s
sys 0m23.644s
nguenther@data:~$ du -hs datasets/uk-biobank/
9.6G datasets/uk-biobank/
Now annex it. I don't have proper docs on this yet, I just keep referring back to the last time I annexed a dataset. I have these to work from:
First, save a ton of local space (#23); this risks losing data, but only if you don't sync to the server eagerly, which is not, I think, going to be a problem in our usage:
git config --global annex.thin true
Then run this:
git init
cat > .gitattributes <<EOF
*.nii filter=annex annex.largefiles=anything
*.nii.gz filter=annex annex.largefiles=anything
EOF
git add README.txt && git commit -m "Initial commit"
git annex init
git add .gitattributes && git commit -m "Configure git-annex"
time git add .
git commit -m "Migrate dataset from smb://duke.neuro.polymtl.ca/temp/uk_biobank_BIDS to git-annex."
git annex dead here # this is *not* a source for others to download from
git remote add origin git@data.neuro.polymtl.ca:datasets/uk-biobank
git annex sync --content origin
git annex sync --content # do it a second time because the first one glitched
It's weird to have to run sync
twice but that's the beans because gitolite
needs push to create the repo, but
git-annextries to start by syncing from an existing repo to make sure it's up to date. I guess I should have started with
git push origin`.
Checking the work:
nguenther@data:~/datasets/uk-biobank$ git annex whereis sub-1047359/
whereis sub-1047359/anat/sub-1047359_T1w.nii.gz (1 copy)
ea71146c-9360-4041-8378-87ab0dfff167 -- git@data.neuro.polymtl.ca:~/repositories/datasets/uk-biobank.git [origin]
ok
whereis sub-1047359/anat/sub-1047359_T2w.nii.gz (1 copy)
ea71146c-9360-4041-8378-87ab0dfff167 -- git@data.neuro.polymtl.ca:~/repositories/datasets/uk-biobank.git [origin]
ok
@mpompolas reported accessing the data looks good, but it's missing the derivatives/
folder.
We went on a bit of a hunt and Sandrine Bedard was able to provide some of these from her work. I'm unsure if these are all the derivatives that should be attached to this dataset, but they're a good start.
nguenther@data:~/datasets/uk-biobank$ git annex sync --content
commit
On branch master
nothing to commit, working tree clean
ok
pull origin
Enter passphrase for key '/home/nguenther/.ssh/id_ed25519':
ok
Import the extra files:
nguenther@data:~/datasets/uk-biobank$ sudo mount -t cifs //duke.neuro.polymtl.ca/temp /mnt/duke/temp -o username=u[REDACTED],noexec
[...]
nguenther@data:~/datasets/uk-biobank$ rsync -av /mnt/duke/temp/sebeda/test1_30_sub/data_BIDS/derivatives .
sending incremental file list
derivatives/
derivatives/labels/
derivatives/labels/sub-1000032/
derivatives/labels/sub-1000032/anat/
derivatives/labels/sub-1000032/anat/sub-1000032_T1w_RPI_r_gradcorr_seg-manual.json
derivatives/labels/sub-1000032/anat/sub-1000032_T1w_RPI_r_gradcorr_seg-manual.nii.gz
derivatives/labels/sub-1002358/
derivatives/labels/sub-1002358/anat/
derivatives/labels/sub-1002358/anat/sub-1002358_T1w_RPI_r_gradcorr_labels-manual.json
derivatives/labels/sub-1002358/anat/sub-1002358_T1w_RPI_r_gradcorr_labels-manual.nii.gz
derivatives/labels/sub-1002702/
derivatives/labels/sub-1002702/anat/
derivatives/labels/sub-1002702/anat/sub-1002702_T1w_RPI_r_gradcorr_labels-manual.json
derivatives/labels/sub-1002702/anat/sub-1002702_T1w_RPI_r_gradcorr_labels-manual.nii.gz
derivatives/labels/sub-1004614/
derivatives/labels/sub-1004614/anat/
derivatives/labels/sub-1004614/anat/sub-1004614_T1w_RPI_r_gradcorr_labels-manual.json
derivatives/labels/sub-1004614/anat/sub-1004614_T1w_RPI_r_gradcorr_labels-manual.nii.gz
sent 778,293 bytes received 224 bytes 519,011.33 bytes/sec
total size is 777,041 speedup is 1.00
nguenther@data:~/datasets/uk-biobank$ git status
On branch master
Untracked files:
(use "git add <file>..." to include in what will be committed)
derivatives/
nothing added to commit but untracked files present (use "git add" to track)
nguenther@data:~/datasets/uk-biobank$ git add derivatives/
nguenther@data:~/datasets/uk-biobank$ git commit -m "Import some derivatives created by Sandrine Bedard.
>
> These include cord segmentation on T1w and T2 FLAIR & C2-C3 label on the T1w scan, maybe others?
> I don't know what that jargon means, it's what Julien thinks it should be.
> "
(recording state in git...)
[master a7ee1a7] Import some derivatives created by Sandrine Bedard.
8 files changed, 20 insertions(+)
create mode 100755 derivatives/labels/sub-1000032/anat/sub-1000032_T1w_RPI_r_gradcorr_seg-manual.json
create mode 100755 derivatives/labels/sub-1000032/anat/sub-1000032_T1w_RPI_r_gradcorr_seg-manual.nii.gz
create mode 100755 derivatives/labels/sub-1002358/anat/sub-1002358_T1w_RPI_r_gradcorr_labels-manual.json
create mode 100755 derivatives/labels/sub-1002358/anat/sub-1002358_T1w_RPI_r_gradcorr_labels-manual.nii.gz
create mode 100755 derivatives/labels/sub-1002702/anat/sub-1002702_T1w_RPI_r_gradcorr_labels-manual.json
create mode 100755 derivatives/labels/sub-1002702/anat/sub-1002702_T1w_RPI_r_gradcorr_labels-manual.nii.gz
create mode 100755 derivatives/labels/sub-1004614/anat/sub-1004614_T1w_RPI_r_gradcorr_labels-manual.json
create mode 100755 derivatives/labels/sub-1004614/anat/sub-1004614_T1w_RPI_r_gradcorr_labels-manual.nii.gz
Double-check that git-annex annexed the .nii.gz files by looking at what git thinks is there: it should be a single line representing the annex pointer.
nguenther@data:~/datasets/uk-biobank$ git show derivatives/labels/sub-1002358/anat/sub-1002358_T1w_RPI_r_gradcorr_labels-manual.nii.gz
commit a7ee1a7b4f0e53ef79b8d95c730e070856364095 (HEAD -> master)
Author: Nick Guenther <nick.guenther@polymtl.ca>
Date: Fri Dec 11 20:16:00 2020 -0500
Import some derivatives created by Sandrine Bedard.
These include cord segmentation on T1w and T2 FLAIR & C2-C3 label on the T1w scan, maybe others?
I don't know what that jargon means, it's what Julien thinks it should be.
diff --git a/derivatives/labels/sub-1002358/anat/sub-1002358_T1w_RPI_r_gradcorr_labels-manual.nii.gz b/derivatives/labels/sub-1002358/anat/sub-1002358_T1w_RPI_r_gradcorr_labels-manual.nii.gz
new file mode 100755
index 0000000..d948070
--- /dev/null
+++ b/derivatives/labels/sub-1002358/anat/sub-1002358_T1w_RPI_r_gradcorr_labels-manual.nii.gz
@@ -0,0 +1 @@
+/annex/objects/SHA256E-s238169--ddd6504851f1947ac0306347a7c2a3a9d8d0bcd6c7bae705ef17d6d583bf2b02.nii.gz
Good. Upload it:
nguenther@data:~/datasets/uk-biobank$ git annex sync --content
commit
On branch master
nothing to commit, working tree clean
ok
pull origin
Enter passphrase for key '/home/nguenther/.ssh/id_ed25519':
ok
copy derivatives/labels/sub-1000032/anat/sub-1000032_T1w_RPI_r_gradcorr_seg-manual.nii.gz (to origin...)
ok
copy derivatives/labels/sub-1002358/anat/sub-1002358_T1w_RPI_r_gradcorr_labels-manual.nii.gz (to origin...)
ok
copy derivatives/labels/sub-1002702/anat/sub-1002702_T1w_RPI_r_gradcorr_labels-manual.nii.gz (to origin...)
ok
copy derivatives/labels/sub-1004614/anat/sub-1004614_T1w_RPI_r_gradcorr_labels-manual.nii.gz (to origin...)
ok
pull origin
ok
(recording state in git...)
push origin
Enumerating objects: 51, done.
Counting objects: 100% (51/51), done.
Compressing objects: 100% (37/37), done.
Writing objects: 100% (48/48), 3.85 KiB | 3.85 MiB/s, done.
Total 48 (delta 14), reused 0 (delta 0), pack-reused 0
remote: Checking connectivity: 48, done.
To data.neuro.polymtl.ca:datasets/uk-biobank
ec74c49..1749d94 git-annex -> synced/git-annex
b1893db..a7ee1a7 master -> synced/master
ok
Done for now! If we find more derivatives I'll reopen.
alexfoias has made a minimal viable dataset out of our copy of the UK Biobank data for doing some experiments on. It is currently sitting on our internal server at
smb://duke/tmp/uk_biobank_BIDS
. Julien wants it to be nameduk-biobank
. I can do that along the way.