Closed kousu closed 3 years ago
Also:
Also, if either of these do help, can we set them globally? Does git annex config
let us manipulate the configs of users' clones? If it works like git config
then the answer is no, because git config
is always local to each copy of each repo, but I suspect git annex config
is special.
If not, then we'll have to document around this.
I imagine setting annex.thin
everywhere is one of the many little fix-its that datalad
does without telling you to work around git-annex being flaky.
Useful note, buried in an out of the way place in the docs: https://git-annex.branchable.com/git-annex-adjust/
To use less space, annex.thin can be set to true before running this command; this makes a hard link to the content be made instead of a copy. (When supported by the file system.) While this can save considerable disk space, any modification made to a file will cause the old version of the file to be lost from the local repository. So, enable annex.thin with care.
but I'm pretty sure this is really old advice; https://git-annex.branchable.com/design/adjusted_branches/ explains git-annex-adjust
was designed for "Using a v6 repo with locked files on a crippled filesystem not supporting symlinks. "
5 months ago, tagged "needs thought": https://git-annex.branchable.com/todo/annex.thin_without_hardlinks/
Would it be possible to make it work w/o hard links? Note that direct mode does avoid two copies of files.
(that should say 'avoided'; direct mode was removed 7 years ago)
https://git-annex.branchable.com/tips/local_caching_of_annexed_files/ keeps getting linked from all these other threads, because the proposed solution involves setting up inter-repo hardlinks (similar to git clone -l
, aka how Github does forks: they're not really forks at all). Particularly, it, and here I think that means the author, joeyh, explains:
But, using a git repository lets annex.hardlink be used to make hard links between the cache and repositories using it.
annex.hardlink can be set to true, and then git annex get will simply hardlink the files into place.
So: annex.hardlink
is only for intra-system data-sharing. It doesn't help with our usual use case of having multiple people working on the same data on separate systems, systems that probably don't have a lot of space to spare.
My intuition, and possibly that second comment, tells me you should be setting annex.hardlink
on the consumer repo, but the example in the first link shows setting it on the source repo, and so does this script from 8 days ago. Which doesn't make any sense: how are the consumer repos picking up that information from the source repo? git clone
doesn't let you access the per-repo .git/config file.
I experimented on a realistic dataset during #30.
nguenther@data:~$ cd datasets/uk-biobank/
nguenther@data:~/datasets/uk-biobank$ git init
Initialized empty Git repository in /home/nguenther/datasets/uk-biobank/.git/
nguenther@data:~/datasets/uk-biobank$ cat > .gitattributes <<EOF
> *.nii filter=annex annex.largefiles=anything
> *.nii.gz filter=annex annex.largefiles=anything
> EOF
nguenther@data:~/datasets/uk-biobank$ git add README.txt && git commit -m "Initial commit"
[master (root-commit) 7ce586f] Initial commit
1 file changed, 1 insertion(+)
create mode 100755 README.txt
nguenther@data:~/datasets/uk-biobank$ git annex init
init (scanning for unlocked files...)
ok
(recording state in git...)
nguenther@data:~/datasets/uk-biobank$ git add .gitattributes && git commit -m "Configure git-annex"
[master db6053e] Configure git-annex
1 file changed, 2 insertions(+)
create mode 100644 .gitattributes
nguenther@data:~/datasets/uk-biobank$ time git add .
real 2m28.694s
user 1m3.502s
sys 0m21.155s
nguenther@data:~/datasets/uk-biobank$ time git commit -m "Migrate dataset from smb://duke/temp/uk_biobank_BIDS to git-annex.
>
> This dataset was curated by alex.foias@polymtl.ca out of data from the UK Biobank.
> "
(recording state in git...)
[master 95e0f07] Migrate dataset from smb://duke/temp/uk_biobank_BIDS to git-annex.
1190 files changed, 45857 insertions(+)
create mode 100755 dataset_description.json
[...]
create mode 100755 sub-1047359/anat/sub-1047359_T2w.nii.gz
real 0m2.360s
user 0m0.239s
sys 0m0.349s
nguenther@data:~/datasets/uk-biobank$ du -hs .
20G .
nguenther@data:~/datasets/uk-biobank$ du -hs .git/annex/
9.6G .git/annex/
nguenther@data:~/datasets/uk-biobank$ git init
Initialized empty Git repository in /home/nguenther/datasets/uk-biobank/.git/
nguenther@data:~/datasets/uk-biobank$ cat > .gitattributes <<EOF
> *.nii filter=annex annex.largefiles=anything
> *.nii.gz filter=annex annex.largefiles=anything
> EOF
nguenther@data:~/datasets/uk-biobank$ git add README.txt && git commit -m "Initial commit"
[master (root-commit) df31bbe] Initial commit
1 file changed, 1 insertion(+)
create mode 100755 README.txt
nguenther@data:~/datasets/uk-biobank$ git annex init
init (scanning for unlocked files...)
ok
(recording state in git...)
nguenther@data:~/datasets/uk-biobank$ git add .gitattributes && git commit -m "Configure git-annex"
[master 9a3087c] Configure git-annex
1 file changed, 2 insertions(+)
create mode 100644 .gitattributes
nguenther@data:~/datasets/uk-biobank$ git config annex.thin true
nguenther@data:~/datasets/uk-biobank$ time git add .
real 3m13.583s
user 1m16.148s
sys 0m14.944s
nguenther@data:~/datasets/uk-biobank$ du -hs .
9.6G .
nguenther@data:~/datasets/uk-biobank$ du -hs .git/annex/
9.6G .git/annex/
annex.thin has saved space by using hardlinks instead of copying (this is a bit hard to read; the first number is the inode, the "2" means "2 links", namely the two filenames listed there):
nguenther@data:~/datasets/uk-biobank$ ls -li {sub-1047359/*/*,.git/annex/objects/*/*/*/*} | sort | grep 2360806
2360806 -rwxr-xr-x 2 nguenther nguenther 17619352 May 23 2018 .git/annex/objects/w0/V1/SHA256E-s17619352--eaf7cfe8f31880d2a93adc798a0fb606a0406a6fcf0b6c1dead6019e18f4f2cf.nii.gz/SHA256E-s17619352--eaf7cfe8f31880d2a93adc798a0fb606a0406a6fcf0b6c1dead6019e18f4f2cf.nii.gz
2360806 -rwxr-xr-x 2 nguenther nguenther 17619352 May 23 2018 sub-1047359/anat/sub-1047359_T1w.nii.gz
But as explained in https://git-annex.branchable.com/bugs/annex.hardlink_is_not___34__in_effect__34___in_thin_mode_/#comment-1e59307b5219485f034f29121e8378e6, what happens if we mess with the files?
Consider:
nguenther@data:~/datasets/uk-biobank$ ls -li {sub-1047359/*/*,.git/annex/objects/*/*/*/*} | sort | grep 2360806
2360806 -rwxr-xr-x 2 nguenther nguenther 17619352 May 23 2018 .git/annex/objects/w0/V1/SHA256E-s17619352--eaf7cfe8f31880d2a93adc798a0fb606a0406a6fcf0b6c1dead6019e18f4f2cf.nii.gz/SHA256E-s17619352--eaf7cfe8f31880d2a93adc798a0fb606a0406a6fcf0b6c1dead6019e18f4f2cf.nii.gz
2360806 -rwxr-xr-x 2 nguenther nguenther 17619352 May 23 2018 sub-1047359/anat/sub-1047359_T1w.nii.gz
nguenther@data:~/datasets/uk-biobank$ sha256sum sub-1047359/anat/sub-1047359_T1w.nii.gz
eaf7cfe8f31880d2a93adc798a0fb606a0406a6fcf0b6c1dead6019e18f4f2cf sub-1047359/anat/sub-1047359_T1w.nii.gz
If we edit it directly, the file is changed but the name it has in .git/annex/objects has the wrong hash in it:
nguenther@data:~/datasets/uk-biobank$ echo "ohnoes" >> sub-1047359/anat/sub-1047359_T1w.nii.gz
nguenther@data:~/datasets/uk-biobank$ sha256sum sub-1047359/anat/sub-1047359_T1w.nii.gz
9d8217a88f8e8e76a1b79107bac9c8abc32a0a6312afc2920d9becf481cea3c3 sub-1047359/anat/sub-1047359_T1w.nii.gz
nguenther@data:~/datasets/uk-biobank$ ls -li {sub-1047359/*/*,.git/annex/objects/*/*/*/*} | sort | grep 2360806
2360806 -rwxr-xr-x 2 nguenther nguenther 17619352 May 23 2018 .git/annex/objects/w0/V1/SHA256E-s17619352--eaf7cfe8f31880d2a93adc798a0fb606a0406a6fcf0b6c1dead6019e18f4f2cf.nii.gz/SHA256E-s17619352--eaf7cfe8f31880d2a93adc798a0fb606a0406a6fcf0b6c1dead6019e18f4f2cf.nii.gz
2360806 -rwxr-xr-x 2 nguenther nguenther 17619352 May 23 2018 sub-1047359/anat/sub-1047359_T1w.nii.gz
But git
detects it as modified correctly:
nguenther@data:~/datasets/uk-biobank$ git status
Refresh index: 100% (1192/1192), done.
On branch master
Changes to be committed:
(use "git restore --staged <file>..." to unstage)
new file: dataset_description.json
new file: participants.json
new file: participants.tsv
new file: sub-1000032/anat/sub-1000032_T1w.json
new file: sub-1000032/anat/sub-1000032_T1w.nii.gz
[...]
new file: sub-1047359/anat/sub-1047359_T2w.json
new file: sub-1047359/anat/sub-1047359_T2w.nii.gz
Changes not staged for commit:
(use "git add <file>..." to update what will be committed)
(use "git restore <file>..." to discard changes in working directory)
modified: sub-1047359/anat/sub-1047359_T1w.nii.gz
And a simple git add
sees git-annex
detect and fix the inconsistency, renaming the .git/annex/objects file to have "9d8217a88..." in its name. This would seem to contradict this warning
nguenther@data:~/datasets/uk-biobank$ git add sub-1047359/anat/sub-1047359_T1w.nii.gz
nguenther@data:~/datasets/uk-biobank$ ls -li {sub-1047359/*/*,.git/annex/objects/*/*/*/*} | sort | grep 2360806
2360806 -rwxr-xr-x 2 nguenther nguenther 17619359 Dec 10 05:29 .git/annex/objects/gw/1q/SHA256E-s17619359--9d8217a88f8e8e76a1b79107bac9c8abc32a0a6312afc2920d9becf481cea3c3.nii.gz/SHA256E-s17619359--9d8217a88f8e8e76a1b79107bac9c8abc32a0a6312afc2920d9becf481cea3c3.nii.gz
2360806 -rwxr-xr-x 2 nguenther nguenther 17619359 Dec 10 05:29 sub-1047359/anat/sub-1047359_T1w.nii.gz
nguenther@data:~/datasets/uk-biobank$ sha256sum sub-1047359/anat/sub-1047359_T1w.nii.gz
9d8217a88f8e8e76a1b79107bac9c8abc32a0a6312afc2920d9becf481cea3c3 sub-1047359/anat/sub-1047359_T1w.nii.gz
Reset:
nguenther@data:~/datasets/uk-biobank$ cd ../..
nguenther@data:~$ time rsync -av /mnt/duke/temp/uk_biobank_BIDS/ datasets/uk-biobank
sending incremental file list
./
sub-1047359/anat/sub-1047359_T1w.nii.gz
sent 17,672,368 bytes received 748 bytes 7,069,246.40 bytes/sec
total size is 10,209,520,821 speedup is 577.69
real 0m2.034s
user 0m0.036s
sys 0m0.157s
nguenther@data:~$ cd datasets/uk-biobank/
nguenther@data:~/datasets/uk-biobank$ sudo rm -rf .git
[sudo] password for nguenther:
nguenther@data:~/datasets/uk-biobank$ du -hs .
9.6G .
nguenther@data:~/datasets/uk-biobank$ git init
Initialized empty Git repository in /home/nguenther/datasets/uk-biobank/.git/
nguenther@data:~/datasets/uk-biobank$ cat > .gitattributes <<EOF
> *.nii filter=annex annex.largefiles=anything
> *.nii.gz filter=annex annex.largefiles=anything
> EOF
nguenther@data:~/datasets/uk-biobank$ git add README.txt && git commit -m "Initial commit"
[master (root-commit) 69d0741] Initial commit
1 file changed, 1 insertion(+)
create mode 100755 README.txt
nguenther@data:~/datasets/uk-biobank$ git annex init
init (scanning for unlocked files...)
ok
(recording state in git...)
nguenther@data:~/datasets/uk-biobank$ git add .gitattributes && git commit -m "Configure git-annex"
[master 8337921] Configure git-annex
1 file changed, 2 insertions(+)
create mode 100644 .gitattributes
nguenther@data:~/datasets/uk-biobank$ git config annex.hardlink true
nguenther@data:~/datasets/uk-biobank$ time git add .
real 2m18.590s
user 1m1.055s
sys 0m21.077s
nguenther@data:~/datasets/uk-biobank$ du -hs .
20G .
nguenther@data:~/datasets/uk-biobank$ du -hs .git/annex/objects/
9.6G .git/annex/objects/
Right. Well. There's my answer then. annex.hardlink
does nothing.
Did it ever? Maybe it did something in v7 but not in v8?
I thought of a way annex.thin
can corrupt data. http://web.archive.org/web/20201210105827/https://git-annex.branchable.com/bugs/annex.thin_can_cause_corrupt___40__not_just_missing__41___data/ doesn't explain this (because it dismisses it as "just missing data") (and maybe it was written with v7 in mind??). Apparently git-annex is smart enough to know how to rename the internal .git/annex/objects file path when its content changes. But here's what can go wrong:
git add annexed-file
git commit -m "v1"
edit annexed-file
git add annexed-file
git commit -m "v2"
edit annexed-file
git add annexed-file
git commit -m "v3"
git annex sync --content
edit annexed-file
git add annexed-file
git commit -m "v4"
git checkout HEAD^2 # annexed-file is irrecoverable here
the trouble is that when v2 was made, it overwrote the v1 file in place, and then the v3 file overwrote the v2 file. Neither v1 nor v2 exist anymore; their hash exists, and was committed to the repo, but that's it. v3 is recoverable, but it will have to be redownloaded from whatever remotes got a copy of it.
I don't think this is a big deal for the way we use this stuff. Just something to watch out for.
I think I understand these now, and my recommendation is:
git config --global annex.thin true
git config --global annex.hardlink false
I've added this to the docs.
In git-annex v8, the default config makes checked out files the full files, instead of symlinks like in v7 and earlier, git-lfs style. But if
.git/annex
also stores a copy of all the data then users are doubling their storage for nothing. (note: this does not double the storage on a server; servers only keep git 'bare' git repos, without a checked out copy)There are two options, 'annex.hardlink' and 'annex.thin' but I can't tell what they do. It sounds like they should avoid the duplication by, but if so why are there two of them? Are they mutually exclusive? What happens on Windows?
I found this thread where some of the people from datalad seem to be equally confused: https://git-annex.branchable.com/bugs/annex.hardlink_is_not___34__in_effect__34___in_thin_mode_/.
In git-annex v7 there was the "adjusted" branch which I think was meant to accomplish the same goal?