Closed jcohenadad closed 4 years ago
@jcohenadad @kousu As per discussion:
data-multi-subject.spine-generic.neuropoly
http://data-multi-subject.spine-generic.neuropoly.s3.amazonaws.com
I'll take care of this. I look around to see if someone's written a git annex migrate
and if not I'll read git lfs migrate
. I assume it involves some clever git filter-branch
command.
I've hit a snag. git-annex-remote-s3
hasn't been maintained, and now it's not compatible with Amazon locations outside of the US.
There's a note on https://awscli.amazonaws.com/v2/documentation/api/latest/reference/s3api/create-bucket.html
If you send your create bucket request to the s3.amazonaws.com endpoint, the request goes to the us-east-1 Region. Accordingly, the signature calculations in Signature Version 4 must use us-east-1 as the Region, even if the location constraint in the request specifies another Region where the bucket is to be created. If you create a bucket in a Region other than US East (N. Virginia), your application must be able to handle 307 redirect. For more information, see Virtual Hosting of Buckets .
which sounds like this problem, though I haven't proven it to myself yet.
I can probably patch git-annex locally to fix it to get uploading done -- the bug is only on the uploading side, downloads should be unaffected, but I'll have to dig into that. I don't know Haskell.
This only affects bucket creation right? For our use cases buckets could be created manually in the proper region and then used by the git-annex remotes.
Surprisingly I cannot find the documentation to create an S3 remote pointing to an existing bucket, it seems initremote type=S3
will attempt to create the bucket no matter what? That can't be.
@Drulex I edited my post with an attempt at working around by awscli
; it didn't work.
This is reinforcing my impression that git-annex
is a buggy, sprawling codebase.
You can look through the code in
git clone --depth 1 git://git-annex.branchable.com/ git-annex
cd git-annex/Remote/
vi S3.hs
and try to figure it out with me.
Maybe it's not a bug in git-annex
afterall, they're using https://hackage.haskell.org/package/aws, which must be responsible for doing the signature computations.
Thanks to @bpinsard, we got this working; the trick was to force signature=v4
:
$ git annex initremote s315 type=S3 bucket="test015.bash.neuropoly" host="s3.ca-central-1.amazonaws.com" encryption=none public=yes publicurl="http://test015.bash.neuropoly.s3.ca-central-1.amazonaws.com" datacenter="ca-central-1" signature=v4
initremote s315 (checking bucket...) (creating bucket in ca-central-1...) ok
(recording state in git...)
nts: sample from @Drulex at https://github.com/Drulex/git-annex-s3-public-test.
Prototype upload:
# in a fresh directory, with this repo checked out nearby
$ cp -pr ../data-multi-subject/{sub-stanford04,.{gitignore,travis.yml,bidsignore}} .
$ git init
Initialized empty Git repository in /home/kousu/src/neuropoly/datalad/data-multi-subject-2/.git/
$ cat > .gitattributes <<EOF
> * filter=annex annex.largefiles=nothing
> *.nii annex.largefiles=anything
> *.nii.gz annex.largefiles=anything
> EOF
$ git annex init
init (scanning for unlocked files...)
ok
(recording state in git...)
$ git add .
$ git commit -m "initial commit"
(recording state in git...)
[master (root-commit) 8680c82] initial commit
20 files changed, 353 insertions(+)
create mode 100644 .bidsignore
create mode 100644 .gitattributes
create mode 100644 .gitignore
create mode 100644 .travis.yml
create mode 100644 sub-stanford04/anat/sub-stanford04_T1w.json
create mode 100644 sub-stanford04/anat/sub-stanford04_T1w.nii.gz
create mode 100644 sub-stanford04/anat/sub-stanford04_T2star.json
create mode 100644 sub-stanford04/anat/sub-stanford04_T2star.nii.gz
create mode 100644 sub-stanford04/anat/sub-stanford04_T2w.json
create mode 100644 sub-stanford04/anat/sub-stanford04_T2w.nii.gz
create mode 100644 sub-stanford04/anat/sub-stanford04_acq-MToff_MTS.json
create mode 100644 sub-stanford04/anat/sub-stanford04_acq-MToff_MTS.nii.gz
create mode 100644 sub-stanford04/anat/sub-stanford04_acq-MTon_MTS.json
create mode 100644 sub-stanford04/anat/sub-stanford04_acq-MTon_MTS.nii.gz
create mode 100644 sub-stanford04/anat/sub-stanford04_acq-T1w_MTS.json
create mode 100644 sub-stanford04/anat/sub-stanford04_acq-T1w_MTS.nii.gz
create mode 100644 sub-stanford04/dwi/sub-stanford04_dwi.bval
create mode 100644 sub-stanford04/dwi/sub-stanford04_dwi.bvec
create mode 100644 sub-stanford04/dwi/sub-stanford04_dwi.json
create mode 100644 sub-stanford04/dwi/sub-stanford04_dwi.nii.gz
$ git annex initremote amazon type=S3 bucket="test003.kousu.neuropoly" host="s3.ca-central-1.amazonaws.com" datacenter="ca-central-1" signature="v4" encryption="none" public="yes" publicurl="http://test003.kousu.neuropoly.s3.ca-central-1.amazonaws.com"
initremote amazon (checking bucket...) (creating bucket in ca-central-1...) ok
(recording state in git...)
$ git annex sync --content
commit
On branch master
nothing to commit, working tree clean
ok
copy sub-stanford04/anat/sub-stanford04_T1w.nii.gz (checking amazon...) (to amazon...)
ok
copy sub-stanford04/anat/sub-stanford04_T2star.nii.gz (checking amazon...) (to amazon...)
ok
copy sub-stanford04/anat/sub-stanford04_T2w.nii.gz (checking amazon...) (to amazon...)
ok
copy sub-stanford04/anat/sub-stanford04_acq-MToff_MTS.nii.gz (checking amazon...) (to amazon...)
ok
copy sub-stanford04/anat/sub-stanford04_acq-MTon_MTS.nii.gz (checking amazon...) (to amazon...)
ok
copy sub-stanford04/anat/sub-stanford04_acq-T1w_MTS.nii.gz (checking amazon...) (to amazon...)
ok
copy sub-stanford04/dwi/sub-stanford04_dwi.nii.gz (checking amazon...) (to amazon...)
ok
(recording state in git...)
$ git annex whereis
whereis sub-stanford04/anat/sub-stanford04_T1w.nii.gz (2 copies)
a38383f4-4d12-4319-921f-16867d42e18e -- kousu@requiem:~/src/neuropoly/datalad/data-multi-subject-2 [here]
fd841f65-3ed2-4866-bfc2-1e89bf1640b7 -- [amazon]
amazon: http://test003.kousu.neuropoly.s3.ca-central-1.amazonaws.com/SHA256E-s26601365--96160ffb56794291335842b66284e50bad48690fcfeb901495567579e9af394a.nii.gz
ok
whereis sub-stanford04/anat/sub-stanford04_T2star.nii.gz (2 copies)
a38383f4-4d12-4319-921f-16867d42e18e -- kousu@requiem:~/src/neuropoly/datalad/data-multi-subject-2 [here]
fd841f65-3ed2-4866-bfc2-1e89bf1640b7 -- [amazon]
amazon: http://test003.kousu.neuropoly.s3.ca-central-1.amazonaws.com/SHA256E-s3335765--a27be1a65368768c9a067bbd23821c1c1869579ed576a9367e4c3fc65b4f96f5.nii.gz
ok
whereis sub-stanford04/anat/sub-stanford04_T2w.nii.gz (2 copies)
a38383f4-4d12-4319-921f-16867d42e18e -- kousu@requiem:~/src/neuropoly/datalad/data-multi-subject-2 [here]
fd841f65-3ed2-4866-bfc2-1e89bf1640b7 -- [amazon]
amazon: http://test003.kousu.neuropoly.s3.ca-central-1.amazonaws.com/SHA256E-s7728510--59cf6600acbfd789a7273a83c7e2c035d53e5ee0db9154890f6d26803fb06f1d.nii.gz
ok
whereis sub-stanford04/anat/sub-stanford04_acq-MToff_MTS.nii.gz (2 copies)
a38383f4-4d12-4319-921f-16867d42e18e -- kousu@requiem:~/src/neuropoly/datalad/data-multi-subject-2 [here]
fd841f65-3ed2-4866-bfc2-1e89bf1640b7 -- [amazon]
amazon: http://test003.kousu.neuropoly.s3.ca-central-1.amazonaws.com/SHA256E-s1169659--1173fcf48fe28896ffa395a3aaa7dd935c84688d2b0cab527efc87d4420e0181.nii.gz
ok
whereis sub-stanford04/anat/sub-stanford04_acq-MTon_MTS.nii.gz (2 copies)
a38383f4-4d12-4319-921f-16867d42e18e -- kousu@requiem:~/src/neuropoly/datalad/data-multi-subject-2 [here]
fd841f65-3ed2-4866-bfc2-1e89bf1640b7 -- [amazon]
amazon: http://test003.kousu.neuropoly.s3.ca-central-1.amazonaws.com/SHA256E-s1099297--13b43f18c00f0bc7908766fb437e5c80238810fea016e40a79417e62d0187075.nii.gz
ok
whereis sub-stanford04/anat/sub-stanford04_acq-T1w_MTS.nii.gz (2 copies)
a38383f4-4d12-4319-921f-16867d42e18e -- kousu@requiem:~/src/neuropoly/datalad/data-multi-subject-2 [here]
fd841f65-3ed2-4866-bfc2-1e89bf1640b7 -- [amazon]
amazon: http://test003.kousu.neuropoly.s3.ca-central-1.amazonaws.com/SHA256E-s1072679--c6e9f9e20eb11905155e3cf05a1fe3a59a1065026204c8d38da55ec687ce8f9d.nii.gz
ok
whereis sub-stanford04/dwi/sub-stanford04_dwi.nii.gz (2 copies)
a38383f4-4d12-4319-921f-16867d42e18e -- kousu@requiem:~/src/neuropoly/datalad/data-multi-subject-2 [here]
fd841f65-3ed2-4866-bfc2-1e89bf1640b7 -- [amazon]
amazon: http://test003.kousu.neuropoly.s3.ca-central-1.amazonaws.com/SHA256E-s3407685--34668274ff76c256aba916139c531a2dd71b0b9ec6352b19ad92622d94f1b95f.nii.gz
ok
You can try: http://test003.kousu.neuropoly.s3.ca-central-1.amazonaws.com/SHA256E-s3407685--34668274ff76c256aba916139c531a2dd71b0b9ec6352b19ad92622d94f1b95f.nii.gz and the rest are available online.
I'm changing the naming convention to
data-multi-subject---spine-generic---neuropoly
because when I tried to use https:// I found Amazon's cert is for *.s3.ca-central-1.amazonaws.com
and this only means one level of subdomain is supported.
Alright, it's migrated to the annex and uploaded to Amazon. It's at https://github.com/kousu/spine-generic-data-multi-subject/tree/master-annexed. I didn't make a fork for this because I want to rewrite the history completely, so no PR, if it looks good @jcohenadad you're going to have to do some command line shennanigans to get it in (and then we'll have to walk everyone contributing to the repo through switching).
To test it:
git clone https://github.com/kousu/spine-generic-data-multi-subject data-multi-subject-annexed
cd data-multi-subject-annexed
git annex init
git annex get .
To overwrite the current branch:
cd data-multi-subject # go back to this repo
git remote add kousu https://github.com/kousu/spine-generic-data-multi-subject.git
git fetch kousu
git checkout git-annex
git reset --hard kousu/git-annex
git checkout master
git reset --hard kousu/master-annexed
Then to publish, unprotect master
in the project settings, and do
git push -f origin master
git push -f origin git-annex
then reprotect master
.
Then to publish, unprotect master in the project settings, and do
@kousu will it be possible for people to contribute via PR (as currently the case now)? The advantage is that it enables me to go through the PR and look at the proposed changes, before pushing on master. What I understand from the procedure above, is that to update master, we do it directly by git push (instead of going through a PR)? Or maybe people can push to a branch (linked to a PR), but then, how do we do the merge to master?
@kousu do i need to "get" everything in order to test how things work (30GB is big and videotron is not being very supportive...)? e.g.: can i just get a few files, modify them, then upload? and/or upload new files?
@kousu do i need to "get" everything in order to test how things work (30GB is big and videotron is not being very supportive...)? e.g.: can i just get a few files, modify them, then upload? and/or upload new files?
No, you can simply clone the repo, enable the remote and get the file/dataset you want.
git clone git@github.com:kousu/spine-generic-data-multi-subject.git
cd spine-generic-data-multi-subject/
git annex init
git annex info # shows you info about the remotes, in this case we see that it's named amazon
git annex enableremote amazon # enable the remote
git annex list # optional, list files tracked via git-annex
git annex get sub-amu01 # get sub-amu01
@kousu how come the files don't show up as symlinks? In the test repo I had setup, the files tracked via git-annex
store the location in this form: .git/annex/objects/...
whereas in this repo it shows up as /annex/objects/...
.
See https://github.com/Drulex/git-annex-s3-public-test/blob/master/file_a.file vs https://github.com/kousu/spine-generic-data-multi-subject/blob/master-annexed/sub-amu01/anat/sub-amu01_T1w.nii.gz for example comparison.
I suspect something is wrong, because the annexed files should show up as symlinks and when resolved should have permissions preventing writing to them until they are unlocked (see https://git-annex.branchable.com/walkthrough/modifying_annexed_files/) this is not the case here.
I can't even modify a file.
$ git annex unlock sub-amu02_acq-b0_dwi.json
$ echo "test" >> sub-amu02_acq-b0_dwi.json
git commit sub-amu02_acq-b0_dwi.json -m "modified file"
error: file write error: No space left on device
fatal: unable to write loose object file
I think something went wrong during the migration.
EDIT: disregard the no space left on device issue, that's my bad I was working in tmpfs. The other concern remains though.
so, this is what i did:
git clone https://github.com/kousu/spine-generic-data-multi-subject data-multi-subject-annexed
cd data-multi-subject-annexed/
git annex init
git annex get
--> worked, files are being downloaded. I stopped after 13GB (when i saw https://github.com/spine-generic/data-multi-subject/issues/20#issuecomment-669328273)
then, i tested downloading a single file:
git annex get sub-unf07/anat/sub-unf07_T1w*
--> worked
note: i didn't do the following command:
git annex enableremote amazon # enable the remote
@Drulex is that a necessary command? or is it just needed in case there is no default remote set up?
then, i checked the status:
# with git
git status
# and git-annex
git annex status
Both commands show me that all files are "modified":
…
M sub-pavia06/dwi/sub-pavia06_dwi.nii.gz
M sub-perform01/anat/sub-perform01_T1w.nii.gz
M sub-perform01/anat/sub-perform01_T2star.nii.gz
M sub-perform01/anat/sub-perform01_T2w.nii.gz
M sub-perform01/anat/sub-perform01_acq-MToff_MTS.nii.gz
M sub-perform01/anat/sub-perform01_acq-MTon_MTS.nii.gz
Is that expected, given that i did not modify these files?
When i do a git diff
, i don't see any change.
@Drulex is that a necessary command? or is it just needed in case there is no default remote set up?
From my understanding it looks like enableremote
only needs to be run once by the person who created the remote (https://git-annex.branchable.com/walkthrough/using_special_remotes/), perhaps @kousu already enabled it.
Is that expected, given that i did not modify these files?
I don't think so. I don't see that. Did you SIGINT the transfer?
I don't think so. I don't see that. Did you SIGINT the transfer?
yes, as mentioned in https://github.com/spine-generic/data-multi-subject/issues/20#issuecomment-669335288, i did ctrl+c while it was downloading (at ~13GB). i guess that was a bad move 😬
I suspect something is wrong, because the annexed files should show up as symlinks and when resolved should have permissions preventing writing to them until they are unlocked (see https://git-annex.branchable.com/walkthrough/modifying_annexed_files/) this is not the case here.
I configured it to use git-annex smudge
because I thought the symlinks would be confusing than helpful. See .gitattributes
. Turns out it can work more like git-lfs
than I thought :flushed:
(git-annex
is a tool where there are many ways to use it instead of one right way to use it, so we're going to keep stumbling over these ambiguities as we sort out how to match it to our workflows and infrastructure.)
@Drulex is that a necessary command? or is it just needed in case there is no default remote set up?
From my understanding it looks like
enableremote
only needs to be run once by the person who created the remote (https://git-annex.branchable.com/walkthrough/using_special_remotes/), perhaps @kousu already enabled it.
Yes, I set autoenable=true
. So the only thing you should have to do is git clone ...; cd ...; git annex init
or datalad install ...; cd ...; datalad get .
.
Sorry sorry sorry I'm still working to get my notes together to document how I did this. I haven't found any migration tutorial and git-lfs's codebase was unhelpful though I spent a while trying to read it. In the end I just did a giant rebase over all the files which took forever, but I think I have something replicable if we have other datasets to migrate into git-annex.
I don't think so. I don't see that. Did you SIGINT the transfer?
yes, as mentioned in #20 (comment), i did ctrl+c while it was downloading (at ~13GB). i guess that was a bad move grimacing
It should be fine. Just do git annex get .
and it will probably resume?
@jcohenadad sometimes I got a message like this when doing the migration:
git-annex: git status will show derivatives/labels/sub-brnoUhb05/anat/sub-brnoUhb05_T1w_RPI_r_seg-manual.nii.gz to be modified, since content availability has changed and git-annex was unable to update the index. This is only a cosmetic problem affecting git status; git add, git commit, etc won't be affected. To fix the git status display, you can run: git update-index -q --refresh derivatives/labels/sub-brnoUhb05/anat/sub-brnoUhb05_T1w_RPI_r_seg-manual.nii.gz
So I wonder if your files are the right files? For example, what does
sha256 sub-perform01/anat/sub-perform01_T2w.nii.gz
say for you?
On (non-annexed) master
I see
$ sha256sum sub-perform01/anat/sub-perform01_T2w.nii.gz
aa6dac574bf3883a6a4ada2e1feb30425811380953187994d6320ecfb96d5363 sub-perform01/anat/sub-perform01_T2w.nii.gz
Do you see the same on your annexed copy? (I am not going to check my annexed copy right now because switching branches back to the non-annexed master takes about 10 minutes on my computer)
If you check a few and they all have the same, you can apply the suggestion in that git-annex warning to everything with:
git status | sed -n 's/modified://p' | xargs git update-index -q --refresh
that should make their "M"s go away.
(and then check the sha256
s again to make sure everything is the same)
I don't know what this means. Given that there's a warning about it I guess there's a good reason for it, but it seems like a hopefully-harmless bug in git-annex to me.
So I wonder if your files are the right files? For example, what does
sha256 sub-perform01/anat/sub-perform01_T2w.nii.gz
say for you?
On (non-annexed)
master
I see$ sha256sum sub-perform01/anat/sub-perform01_T2w.nii.gz aa6dac574bf3883a6a4ada2e1feb30425811380953187994d6320ecfb96d5363 sub-perform01/anat/sub-perform01_T2w.nii.gz
Do you see the same on your annexed copy?
yup!
shasum -a 256 sub-perform01/anat/sub-perform01_T2w.nii.gz
aa6dac574bf3883a6a4ada2e1feb30425811380953187994d6320ecfb96d5363 sub-perform01/anat/sub-perform01_T2w.nii.gz
however... 😨
git status | sed -n 's/modified://p' | xargs git update-index -q --refresh
shasum -a 256 sub-vuiisIngenia04/anat/sub-vuiisIngenia04_T1w.nii.gz
e0eebf8e45776647565a1d30653fa27496a24c8cd088543dcaa83bf2ecb48a2c sub-vuiisIngenia04/anat/sub-vuiisIngenia04_T1w.nii.gz
and on https://github.com/spine-generic/data-multi-subject (master):
shasum -a 256 sub-vuiisIngenia04/anat/sub-vuiisIngenia04_T1w.nii.gz
535ddd579b34f34609cdb744278e91f07fe3d78d6cf0b266bb26b2b0fe3bdfd6 sub-vuiisIngenia04/anat/sub-vuiisIngenia04_T1w.nii.gz
however, the perform site still has the same sha256 between your repos and the data-multi-subject repos.
EDIT 2020-08-05 16:24:17: mea culpa: i did not "git annex get" the sub-vuiisIngenia04_T1w.nii.gz
file. After i download it, the sha256 are the same 😅 . Sorry for the false alarm.
and yes: the git status is now clean:
$ git status
On branch master-annexed
Your branch is up to date with 'origin/master-annexed'.
nothing to commit, working tree clean
Awesome. It's working.
So update-index
fixed it? I never saw anything in the datalad handbook about that, but it never talks about failed/interrupted downloads either. We'll have to keep that in mind in our own handbooks.
So update-index fixed it?
seems like it! or it was some cosmic rays flowing in the right direction
Something else to keep in mind is that git annex smudge
stores the name of the .git/annex object in the .git/objects file on check-in -- it's basically the same as storing a symlink, but with the extra layer of invisible translation between the point and the real file (this is identical strategy git-lfs
takes). It also seems to be the recommended setup now?
You can see these real, un-smudged contents with git log -p
or git show
. For example, in this commit you can see derivatives/labels/sub-cmrra02/anat/sub-cmrra02_T2w_RPI_r_seg-manual.nii.gz
== /annex/objects/SHA256E-s33324--2d1c38d7bb95d25059508fe48253c450149148ae6e21661f5aadae4d0353ed7a.nii.gz
$ git log -p master-annexed
commit c3c52276ae7fc8266b761cc190eb8471b28c5171 (kousu/master-annexed, master-annexed)
Author: PaulBautin <pbautin70@gmail.com>
Date: Fri Jul 31 12:23:37 2020 -0400
- remove cmrra02
- remove sapienza04 gmseg
- correct gmseg for sapienza01
diff --git a/derivatives/labels/sub-cmrra02/anat/sub-cmrra02_T2w_RPI_r_seg-manual.json b/derivatives/labels/sub-cmrra02/anat/sub-cmrra02_T2w_RPI_r_seg-manual.json
deleted file mode 100644
index 2e8eb40d..00000000
--- a/derivatives/labels/sub-cmrra02/anat/sub-cmrra02_T2w_RPI_r_seg-manual.json
+++ /dev/null
@@ -1,4 +0,0 @@
-{
- "Author": "Paul Bautin",
- "Date": "2020-07-30 11:56:40"
-}
\ No newline at end of file
diff --git a/derivatives/labels/sub-cmrra02/anat/sub-cmrra02_T2w_RPI_r_seg-manual.nii.gz b/derivatives/labels/sub-cmrra02/anat/sub-cmrra02_T2w_RPI_r_seg-manual.nii.gz
deleted file mode 100644
index 9bb8780e..00000000
--- a/derivatives/labels/sub-cmrra02/anat/sub-cmrra02_T2w_RPI_r_seg-manual.nii.gz
+++ /dev/null
@@ -1 +0,0 @@
-/annex/objects/SHA256E-s33324--2d1c38d7bb95d25059508fe48253c450149148ae6e21661f5aadae4d0353ed7a.nii.gz
The different shasum you saw would have been the shasum of something like "/annex/objects/SHA256E-s33324--2d1c38d7bb95d25059508fe48253c450149148ae6e21661f5aadae4d0353ed7a.nii.gz".
This is going to be a perennial inconsistency (and probably git-lfs suffers from it too, if you cancel its checkouts) but I'm glad to see that just restarting the download fixed it.
J'ai fait une erreur! I discovered that I had reset all files that got edited to their initial versions.
The fixed repo is at https://github.com/kousu/spine-generic-data-multi-subject/tree/master-reannexed.
Migrating an existing dataset into git-annex ============================================
One way to deploy git-annex is simply to annex every single file. That seems to be what datalad
does. Another is to pick files step-by-step, choosing between, say git add derivatives/labels/sub-brnoUhb01/
and git annex add sub-stanford06/
.
I think the former is overkill, because git-annex gets in the way (particularly, git diff
stops working on the .json and .md files (there is a workaround but it is difficult to configure).
git-annex has a largefiles
option which lets you automatically annex, for example, everything over 10MB, 50MB, whatever you want; that's a solution but it also will lead to questions about weird inconsistencies like "why is my BIDS sidecar file missing for this image but not for that one?".
In our use case, we know what files need to be annexed: the .nii.gz's; everything else can go in plain-git and benefit from its simpler system (and will save us some money by reducing the number of HTTP GETs Amazon charges us for).
To verify this hypothesis, I ran:
(find . -name .git -prune -o -type f -exec ls -l {} \;) | sort -k 5 -n | tee /tmp/filesizes.txt
and it showed that all the largest files are .nii.gz's, and the smallest are .bvec, .bval, and .json, plus other scattered files:
Indeed, counting by extension underscores the point:
$ find . -type f -iname "*.bvec" -print0 | du --files0-from=- -ch | tail -n 1
1016K total
$ find . -type f -iname "*.bval" -print0 | du --files0-from=- -ch | tail -n 1
1016K total
$ find . -type f -iname "*.json" -print0 | du --files0-from=- -ch | tail -n 1
7.1M total
$ find . -type f -iname "*.nii.gz" -print0 | du --files0-from=- -ch | tail -n 1
11G total
The .nii.gz files account for > 99.9% of the dataset.
git-annex offers many storage backends; including, as datalad points out, a whole sub-tree of options in rclone
. There's even more options than this, because, for example, datalad has been writing more. This flexibility is one of the nice features of git-annex; it's also a bit of a curse, because it there's a complicated series of tradeoffs to evaluate for each dataset.
Our goal with this dataset is to have
In the academic world, there's https://osf.io, https://openneuro.org, and https://zenodo.org do hosting, and https://academictorrents.com do data hosting. OSF has a git-annex plugin but is blocked by Iran and China's country-level blocklists, openneuro doesn't (yet) have a git-annex plugin, and I haven't seen one for Zenodo either; AcademicTorrents is a very interesting option which would be partially supported in git-annex but there's a lot of friction between the data models [1], and anyway they don't have infinite space, they only host your data if you ask nicely and get approved. There is potential storage at Unité Neuroimagerie Fonctionelle (via @bpinsard's team) but it's not generally available yet.
We could maybe put up our own public storage on polymtl.ca's infra but that would have relatively slow turnaround time.
We can't use Github because Github doesn't support git-annex.
We could maybe have used gitlab.com which does support git-annex, though they've deprecated it in favour of git-lfs.
So in the end, we looked around and priced out some commercial options: https://docs.google.com/spreadsheets/d/1MN2SONdHrgTtTKOBrCC9bgJAZNe6Bd1rLSVozEQ8ng4/edit#gid=0 and decided to go with Amazon, which is what OpenNeuro is using underneath anyway.
Amazon has regional datacentres which can be chosen using a code (e.g. ap-south-1
for Mumbai or us-west-1
for Silicon Valley). The full list of codes is here: https://docs.aws.amazon.com/general/latest/gr/rande.html#s3_region. We decided to host our primary copies in Canada, so that's ca-central-1
.
I should remind us here that while Amazon has a CDN -- CloudFront -- we don't have it set up. So while our content should be globally accessible it will generally be slower to get to researchers in farther countries. We will probably have to revisit this, buy CloudFront too, and reconfigure git-annex to download from the CDN (while still uploading to ca-central-1
). We could also choose instead to combine a different company's CDN like CloudFlare, BunnyCDN, KeyCDN, Akamai, StackPath, Azure, Fastly, OVH, LimeLight.... And while I'm listing competitors, by the way, both DigitalOcean's and Vultr's Amazon-compatible storage have CDNs built in.
To get git-lfs
's feature-by-feature that files are annexed by file extension, I used its method of using .gitattributes
: all git add
'd .nii.gz files through a commit "clean" filter which "cleans" the files by replacing them with a pointer to the annex. This is something git-annex seems to fully support just fine, though the docs encourage using it by size, not extension. But its "anything" mode does what I want:
# .gitattributes
* filter=annex annex.largefiles=nothing
*.nii filter=annex annex.largefiles=anything
*.nii.gz filter=annex annex.largefiles=anything
This makes git automatically sort out which files are annexed or not, and as a side-effect it means that the files in the repo are always the real files, never dangling symlinks and there's no need for users to understand or invoke git-annex unlock
.
The command for this turned out to be:
$ export AWS_ACCESS_KEY_ID="..." AWS_SECRET_ACCESS_KEY="..."
$ git annex initremote amazon type=S3 bucket="$BUCKET" host="s3.ca-central-1.amazonaws.com" encryption=none public=yes publicurl="https://$BUCKET.s3.ca-central-1.amazonaws.com" datacenter="ca-central-1" signature=v4
The naming scheme is that https://github.com/$ORGANIZATION/$DATASET.git
uses BUCKET="$DATASET---$ORGANIZATION---neuropoly"
.
Caveat:
$BUCKET
must be a DNS-compatible name; so: ascii only.$BUCKET
must not contain .
s; I don't know if this applies to all of S3, but the Canadian datacenter at least only has a cert for *.s3-ca-central-1.amazonaws.com
, which means it only supports one level of subdomain.It's not actually necessary to configure Amazon before doing the migration. It is just another mirror as far as git-annex is concerned, and mirrors can be added at any time.
Probably the simplest migration would be to remake the repository from scratch:
# migrate, destroying history
git checkout master
rm -rf .git
git init
git annex init
cat > .gitattributes <<EOF
* filter=annex annex.largefiles=nothing
*.nii annex.largefiles=anything
*.nii.gz annex.largefiles=anything
EOF
git add .gitattributes
git commit -m "Configure git-annex"
git add .
git commit -m "Import dataset."
But we wanted to preserve the history. And if we just make a new commit to do the import -- not doing it by a complete rewrite of the commits -- then there's no point because the large files will still be in .git/objects.
I tried to read the source for git lfs migrate
to see how they did it but I got lost. I was hoping there would be some quick trick that didn't involve checking out each and every commit, but I didn't find it. Maybe if I was a git grand-wizard I could understand their codebase enough to figure it out. Maybe something based on git fast-import
? I don't really know.
Instead, I just used a rebase. A giant, tedious, rebase.
I know from gitattributes(5) that clean/smudge filters like git-annex
are run on git add
and git checkout
respectively, so I need to run git add
at each commit. If a repo starts out clean (i.e. no uncommitted files) then during a rebase, each step is also clean so git add .
will correctly add all the needed files. I also know from git-rebase(1) that git reset HEAD^ .
will temporarily undo a commit's changes -- effectively it's a 1-step revert -- without editing the history or changing what commit you're on (NOTE: git reset HEAD^
changes the commit too! The .
is critical!). But there's a catch: deleted files need special handling. And you can automate doing a "command at each commit" with git rebase -x
.
So, putting this all together:
Make a script:
cat > /tmp/migrate-git.sh <<EOF
#!/bin/sh
git reset HEAD^ . &&
# confirm we want to delete the deleted files
( LANG=C git status | awk '
/Changes to be committed:/ { UNSTAGED=0 }
/Changes not staged for commit:/ { UNSTAGED=1 }
/Untracked files:/ { UNSTAGED=0 }
UNSTAGED==1 && /deleted:/ { print $2 }
' | xargs git rm) &&
git add . && # readd the added/changed files
git commit --amend --no-edit
EOF
chmod +x /tmp/migrate-git.sh
# migrate, preserving history
git checkout master
git checkout -b master-reannexed
git annex init
cat > .gitattributes <<EOF
* filter=annex annex.largefiles=nothing
*.nii annex.largefiles=anything
*.nii.gz annex.largefiles=anything
EOF
git add .gitattributes
git commit -m "Configure git-annex"
git rebase -X theirs -i --root -x '/tmp/migrate-git.sh && ( git status | sed -n s/modified://p | grep -v "exec git" | xargs git update-index -q --refresh)'
You must edit the rebase-todo before running it: move "Configure git-annex" to the start; otherwise, git add
won't trigger annexing. And remember to keep its -x
command with it! I actually made it the second commit, because the first commit was a typical initial text files commit and I felt like that was good to leave as the start.
-X theirs
means changes to the .nii.gz files, which become a conflict as the rebase runs and git-annex replaces them with pointers, is instead resolved in favour of the newer version. Without this, the rebase will prompt you to manually fix each conflict one by one with git checkout --theirs [FILE ...]
.
The additional git update-index -q --refresh
command deals with a git-annex
glitch that I don't understand. It just applies git-annex's recommended solution:
git-annex: git status will show $FILE to be modified, since content availability has changed and git-annex was unable to update the index. This is only a cosmetic problem affecting git status; git add, git commit, etc won't be affected. To fix the git status display, you can run: git update-index -q --refresh $FILE
Running this took a couple hours, and it wasn't glitch-free either:
I wish this didn't feel so fragile. But see below for verification.
Caveats:
--preserve-merges
but I fully expect there to be even more git dragons down that road.git-annex
data is tracked semi-independently of the main branch, I think it's okay to do the git-annex parts out of order like this. it's much faster, anyway, because copy
rechecks if it needs to upload every single file every time.To verify the migration was correct, compare the git log and contents before and after:
git checkout master
git log -n 1
time git log --stat --summary > /tmp/plain.log
(time find . -name ".git" -prune -o -type f -exec sha256sum {} \;) > /tmp/plain.sha256sum.txt
master
Here's the state of master
:
master-reannexed
Here's the state of master-reannexed
:
reannexed.log reannexed.sha256sum.txt
With those we can compare the changes made by the migration:
diff -u /tmp/{plain,reannexed}.log > /tmp/log.patch
diff -u /tmp/{plain,reannexed}.sha256sum.txt > /tmp/sha256sum.patch
The changes to the commit log are: commit IDs changed, the merge commits are missing, each .nii.gz instead of saying 0 -> $n bytes now says 0 -> 1 lines, and there's the new .gitattributes commit. (There's also some spurious diffs due to the alignment shifting slightly as numbers changed.) Caveat: it did not read all 6000+ lines in the diff. I might have missed something.
The changes to the checksum show only one in the whole dataset:
+af7cb4acb11625d380ccb6e5d47d8ec9fac007112ceed708bf093acfea95e050 ./.gitattributes
so I'm confident the data is correct. The only thing that might be weird is file permissions, but I don't think we're using those here.
TODO: a more thorough check would line up commits from before and after and run the same comparison, to make sure each step of the history is correct too. But I'm out of time.
At this point, the files are all still local, though they are "annexed" -- if I were to make a git-clone of the local copy it would not come with the complete content.
# set Amazon credentials
export AWS_ACCESS_KEY_ID="..."
export AWS_SECRET_ACCESS_KEY="..."
git annex initremote amazon type=S3 bucket="data-multi-subject---spine-generic---neuropoly" host="s3.ca-central-1.amazonaws.com" port=443 datacenter="ca-central-1" signature="v4" autoenable="true" encryption="none" public="yes" publicurl="https://data-multi-subject---spine-generic---neuropoly.s3.ca-central-1.amazonaws.com"
git annex copy --to=amazon
git annex dead here # *unmark* the local copy
To upload the results.
Verify the upload by looking at:
# see where git-annex thinks things are stored
git annex whereis > /tmp/whereis.txt
# use awscli to see the published results
aws s3 ls --summarize --recursive --human-readable s3://data-multi-subject---spine-generic---neuropoly > /tmp/bucket.txt
Here's those results:
e.g.
whereis derivatives/labels/sub-balgrist02/anat/sub-balgrist02_acq-T1w_MTS_seg-manual.nii.gz (1 copy) 5a5447a8-a9b8-49bc-8276-01a62632b502 -- [amazon]
amazon: https://data-multi-subject---spine-generic---neuropoly.s3.ca-central-1.amazonaws.com/SHA256E-s51875--03d807afc750dbebf54d37833f7be38a18e6deee6315d2ce4d019ae5786de25b.nii.gz ok whereis derivatives/labels/sub-barcelona05/anat/sub-barcelona05_T2w_csfseg-manual.nii.gz (1 copy) 5a5447a8-a9b8-49bc-8276-01a62632b502 -- [amazon]
amazon: https://data-multi-subject---spine-generic---neuropoly.s3.ca-central-1.amazonaws.com/SHA256E-s54551--557dbf64dba078b130bcc4a290f49261e2a925ce9c6993689463670075b0a055.nii.gz ok whereis derivatives/labels/sub-beijingGE03/anat/sub-beijingGE03_T1w_RPI_r_seg-manual.nii.gz (1 copy) 5a5447a8-a9b8-49bc-8276-01a62632b502 -- [amazon]
Here's the total size on S3:
Total Objects: 1967
Total Size: 10.0 GiB
Caveats:
git annex sync --content
instead of copy --to=amazon
; with more than one remote involved this would be quicker, except that it has the nasty side-effect of wanting to connect back to github to upload the git-annex
branch, and on many systems that means asking for a password; so to use that reliably, you would need to tempoarily remove the origin
remote temporarily: git remote remove origin
git annex unused
to help find them, however I'm pretty sure that only considers the latest version of each branch as "used" data, so what it reports will contain, in addition to the genuinely unused data, also older versions that we do want to preserve. I don't know what a good strategy is; do we grant everyone on the team their own pull-request bucket, and then mandate that part of merging pull requests is for, whoever does have permissions to the main repo, manually goes in runs git annex sync --content
?@jcohenadad once you are satisfied that the migration is correct, you need to:
git remote add kousu https://github.com/kousu/spine-generic-data-multi-subject.git
git fetch kousu
git checkout git-annex
git reset --hard kousu/git-annex
git checkout master
git reset --hard kousu/master-reannexed
Then to publish, unprotect master in the project settings, and do
git push -f origin master
git push -f origin git-annex
Sometimes the git-annex
filter will glitch. It will give you a warning in these cases, and a suggested solution. It does prevent git from doing most of its operations correctly however, despite git-annex
calling it "cosmetic". To fix it:
git status | sed -n 's/modified://p' | xargs git update-index -q --refresh
This doesn't touch file contents, so it should be safe to run anytime.
To fully delete a S3 bucket, use aws
(pip install awscli
). Give it your credentials and run:
aws s3 rm s3://$BUCKET --recursive
aws s3api delete-bucket --bucket $BUCKET --region ca-central-1
But this is a complete wipe. If you just need to reset it so you can rerun git annex initremote
, it is enough to do:
aws s3 rm s3://data-multi-subject---spine-generic---neuropoly/annex-uuid
git-annex writes to a lot of places: the git-annex
branch, .git/annex (including several sqlite databases?), .git/config, and .git/info/attributes. To get rid of it without getting rid of the complete repo, you can use
git annex deinit
But we wanted to preserve the history.
Not necessarily. If it is easier to scrap the history, let's scrap it.
master-reannexed
branch. Don't merge this until I'm done.[ placeholder for a short doc explaining how to do pull requests in the git-annex regime. i'll migrate this to the wiki, too ]
Because switching to the non-annexed copy is Very Slow I wanted to figure this out without actually switching to my local master
and merging there. So instead of git pull
I started with fetch
:
$ git fetch origin
$ git --no-pager log --summary -n 1 origin/master # take a look at our desired goal
Note: I know that #21 was done as a squash merge, so there's only one commit I need to worry about here, which makes this whole process faster for me. I could redo a rebase to integrate multiple commits if I really had to, though.
I can integrate this commit with a cherry-pick
:
$ git cherry-pick origin/master
This gave that glitch mentioned above so I had to:
$ # fix the "cosmetic" git-annex glitch
$ git status | sed -n s/modified://p | xargs git update-index -q --refresh
$ git status
On branch master-reannexed
nothing to commit, working tree clean
Then I wanted to see what the state of files after cherry picking a non-annexed branch into an annexed branch. I expected all the files to be unannexed but it turned out that only the new files were:
$ # scan the changed images files, in the top commit
$ # (there's a small hack here I couldn't figure out how to get around: this only works because every folder starts with an "s")
$ git log --summary -n 1 | grep -oP ' s.*?.nii.gz' | xargs git log -n 1 -p --oneline --
Comparing this list to the original commit summary I can tell that all the renamed files are in fact properly annexed but all the new files aren't. I'm not sure I understand why this is but I'm going to roll with it.
We can fix the missing files with the same script as before though.
$ git reset HEAD^ .
$ # confirm we want to delete the deleted files
$ ( LANG=C git status | awk '
> /Changes to be committed:/ { UNSTAGED=0 }
> /Changes not staged for commit:/ { UNSTAGED=1 }
> /Untracked files:/ { UNSTAGED=0 }
> UNSTAGED==1 && /deleted:/ { print $2 }
> ' | xargs git rm)
$ git add .
$ git commit --amend --no-edit
To make sure this was right, view the changes to the image files again:
$ git log --summary -n 1 | grep -oP ' s.*?.nii.gz' | xargs git log -n 1 -p --oneline --
so now the migrated commit retains the renames, and the image files are now annex pointers instead of their content directly.
:+1:
$ export AWS_ACCESS_KEY_ID="...."
$ export AWS_SECRET_ACCESS_KEY="..."
$
$ git annex copy --to=amazon
The files for which the "ok" is on a second line are the ones that were new and needed uploading.
It takes a pretty long time to run git annex copy
like this on a dataset with this many files! I think git annex sync --content
or git annex copy --to=amazon subfolder/
would be faster since they can focus on only what needs uploading; the former should be automatic, and the latter needs you to know what files/folders need uploading.
I tried to use git-annex sync
to see if it would be faster. It was, but at the cost of running into a bunch of errors:
$ git annex sync --content
The root cause of the errors is that git-annex was meant to be used by a single person to distribute files amongst places they own; it doesn't really fit into the pull request model the rest of git uses.
I discovered, instead, though, that you can say:
git annex sync --content amazon
and that is both fast and correct. So that's what I'm going to recommend in the future.
$ git annex sync kousu
(git annex sync $remote
is shorthand for syncing two branches: the current and git-annex
: branch="$(git branch --show-current)" ; git pull $remote; git push $remote; git checkout git-annex; git pull $remote; git push $remote; git checkout $branch
)
@jcohenadad please merge this to master
now:
git remote add kousu https://github.com/kousu/spine-generic-data-multi-subject.git
git annex init
git annex sync kousu
git annex enableremote amazon
git annex sync --content amazon
# force-overwrite `master` with my version
git branch -f master kousu/master-reannexed
git push -f origin master
@kousu i would like to discuss the strategy for moving to git-annex, can we talk?
This is done now. /close?
I suspect something is wrong, because the annexed files should show up as symlinks and when resolved should have permissions preventing writing to them until they are unlocked (see https://git-annex.branchable.com/walkthrough/modifying_annexed_files/) this is not the case here.
I configured it to use
git-annex smudge
because I thought the symlinks would be confusing than helpful. See.gitattributes
. Turns out it can work more likegit-lfs
than I thought
I was wrong about the reason for this!! The real reason for this is that I set this up using git-annex v8, which is what Arch has, and what brew
has, and what conda
has, but not what Neurodebian has (v7), nor Debian (v6), nor oldstable debian (v5).
This change in format was only announced in a tiny one liner here:
and not at all explained on https://git-annex.branchable.com/news/, and as far I can tell there's nothing on the wiki/forum talking about it, except for this head-scratcher https://git-annex.branchable.com/forum/which_version_should_I_install__63__/.
I like the new format, it's better than the symlink mess. But I wish it didn't have this surprise gotcha in it.
This change means datalad needs to rewrite their docs.
I worry I've made a terrible mistake by using v8 and not v7.
possible to re-do it with v7? but what will happen in 10y from now when people will be using git-annex v15?
Well I read that git-annex does silently data migrations but only in the forward direction of course, the way phone apps all work.
So what happens if someone on Debian with v7 publishes a dataset that is then cloned by a collaborator on a Mac with v8? The v8 collaborator is going to silently upgrade the entire thing to v8, and if they send a PR it's going to upgrade the published format, breaking it for the Debian user.
(git-annex was never meant for collaboration!! it's really designed for a single human controlling all the mirrors. datalad is stretching it beyond its design.)
It should be fine to consume a v7 dataset with v8, so all the published datalad and portal.conp.ca and openneuro.org datasets are still fine, but you can't be an editor on them.
We'll just need to make sure everyone is using the same. I think we should recommend conda. I doubt you can even get v7 installed on a Mac anymore, at least not without learning the Haskell environment enough to compile it from source.
Despite its claims to the contrary, this sort of rapid incompatibility makes me wonder how future-proofed it really is. But it's also a fairly mature project so maybe the format is approaching an asymptote. :chart_with_upwards_trend:
Strategy to be described here
bucket name:
neuropoly/spine-generic/data-multi-subject
--> if "/" are not allowed, replace with "."