neuropoly / data-management

Repo that deals with datalad aspects for internal use
4 stars 0 forks source link

Investigate annex.thin and annex.hardlink #23

Closed kousu closed 3 years ago

kousu commented 3 years ago

In git-annex v8, the default config makes checked out files the full files, instead of symlinks like in v7 and earlier, git-lfs style. But if .git/annex also stores a copy of all the data then users are doubling their storage for nothing. (note: this does not double the storage on a server; servers only keep git 'bare' git repos, without a checked out copy)

There are two options, 'annex.hardlink' and 'annex.thin' but I can't tell what they do. It sounds like they should avoid the duplication by, but if so why are there two of them? Are they mutually exclusive? What happens on Windows?

I found this thread where some of the people from datalad seem to be equally confused: https://git-annex.branchable.com/bugs/annex.hardlink_is_not___34__in_effect__34___in_thin_mode_/.

In git-annex v7 there was the "adjusted" branch which I think was meant to accomplish the same goal?

kousu commented 3 years ago

Also:

kousu commented 3 years ago

Also, if either of these do help, can we set them globally? Does git annex config let us manipulate the configs of users' clones? If it works like git config then the answer is no, because git config is always local to each copy of each repo, but I suspect git annex config is special.

If not, then we'll have to document around this.

I imagine setting annex.thin everywhere is one of the many little fix-its that datalad does without telling you to work around git-annex being flaky.

kousu commented 3 years ago

Useful note, buried in an out of the way place in the docs: https://git-annex.branchable.com/git-annex-adjust/

To use less space, annex.thin can be set to true before running this command; this makes a hard link to the content be made instead of a copy. (When supported by the file system.) While this can save considerable disk space, any modification made to a file will cause the old version of the file to be lost from the local repository. So, enable annex.thin with care.

but I'm pretty sure this is really old advice; https://git-annex.branchable.com/design/adjusted_branches/ explains git-annex-adjust was designed for "Using a v6 repo with locked files on a crippled filesystem not supporting symlinks. "

kousu commented 3 years ago

5 months ago, tagged "needs thought": https://git-annex.branchable.com/todo/annex.thin_without_hardlinks/

Would it be possible to make it work w/o hard links? Note that direct mode does avoid two copies of files.

(that should say 'avoided'; direct mode was removed 7 years ago)

kousu commented 3 years ago
kousu commented 3 years ago

https://git-annex.branchable.com/tips/local_caching_of_annexed_files/ keeps getting linked from all these other threads, because the proposed solution involves setting up inter-repo hardlinks (similar to git clone -l, aka how Github does forks: they're not really forks at all). Particularly, it, and here I think that means the author, joeyh, explains:

But, using a git repository lets annex.hardlink be used to make hard links between the cache and repositories using it.

https://git-annex.branchable.com/forum/Sharing_annex_with_local_clones/#comment-6ddacf0478e4f764394ebc45dc191b1a says

annex.hardlink can be set to true, and then git annex get will simply hardlink the files into place.

So: annex.hardlink is only for intra-system data-sharing. It doesn't help with our usual use case of having multiple people working on the same data on separate systems, systems that probably don't have a lot of space to spare.

My intuition, and possibly that second comment, tells me you should be setting annex.hardlink on the consumer repo, but the example in the first link shows setting it on the source repo, and so does this script from 8 days ago. Which doesn't make any sense: how are the consumer repos picking up that information from the source repo? git clone doesn't let you access the per-repo .git/config file.

kousu commented 3 years ago

I experimented on a realistic dataset during #30.

plain

nguenther@data:~$ cd datasets/uk-biobank/
nguenther@data:~/datasets/uk-biobank$ git init
Initialized empty Git repository in /home/nguenther/datasets/uk-biobank/.git/
nguenther@data:~/datasets/uk-biobank$ cat > .gitattributes <<EOF
> *.nii     filter=annex annex.largefiles=anything
> *.nii.gz  filter=annex annex.largefiles=anything
> EOF
nguenther@data:~/datasets/uk-biobank$ git add README.txt && git commit -m "Initial commit"
[master (root-commit) 7ce586f] Initial commit
 1 file changed, 1 insertion(+)
 create mode 100755 README.txt
nguenther@data:~/datasets/uk-biobank$ git annex init
init  (scanning for unlocked files...)
ok
(recording state in git...)
nguenther@data:~/datasets/uk-biobank$ git add .gitattributes && git commit -m "Configure git-annex"
[master db6053e] Configure git-annex
 1 file changed, 2 insertions(+)
 create mode 100644 .gitattributes
nguenther@data:~/datasets/uk-biobank$ time git add .

real    2m28.694s
user    1m3.502s
sys 0m21.155s
nguenther@data:~/datasets/uk-biobank$ time git commit -m "Migrate dataset from smb://duke/temp/uk_biobank_BIDS to git-annex.
> 
> This dataset was curated by alex.foias@polymtl.ca out of data from the UK Biobank.
> "
(recording state in git...)
[master 95e0f07] Migrate dataset from smb://duke/temp/uk_biobank_BIDS to git-annex.
 1190 files changed, 45857 insertions(+)
 create mode 100755 dataset_description.json
 [...]
 create mode 100755 sub-1047359/anat/sub-1047359_T2w.nii.gz

real    0m2.360s
user    0m0.239s
sys 0m0.349s
nguenther@data:~/datasets/uk-biobank$ du -hs .
20G .
nguenther@data:~/datasets/uk-biobank$ du -hs .git/annex/
9.6G    .git/annex/

annex.thin

nguenther@data:~/datasets/uk-biobank$ git init
Initialized empty Git repository in /home/nguenther/datasets/uk-biobank/.git/
nguenther@data:~/datasets/uk-biobank$ cat > .gitattributes <<EOF
> *.nii     filter=annex annex.largefiles=anything
> *.nii.gz  filter=annex annex.largefiles=anything
> EOF
nguenther@data:~/datasets/uk-biobank$ git add README.txt && git commit -m "Initial commit"
[master (root-commit) df31bbe] Initial commit
 1 file changed, 1 insertion(+)
 create mode 100755 README.txt
nguenther@data:~/datasets/uk-biobank$ git annex init
init  (scanning for unlocked files...)
ok
(recording state in git...)
nguenther@data:~/datasets/uk-biobank$ git add .gitattributes && git commit -m "Configure git-annex"
[master 9a3087c] Configure git-annex
 1 file changed, 2 insertions(+)
 create mode 100644 .gitattributes
nguenther@data:~/datasets/uk-biobank$ git config annex.thin true
nguenther@data:~/datasets/uk-biobank$ time git add .

real    3m13.583s
user    1m16.148s
sys 0m14.944s
nguenther@data:~/datasets/uk-biobank$ du -hs .
9.6G    .
nguenther@data:~/datasets/uk-biobank$ du -hs .git/annex/
9.6G    .git/annex/

annex.thin has saved space by using hardlinks instead of copying (this is a bit hard to read; the first number is the inode, the "2" means "2 links", namely the two filenames listed there):

nguenther@data:~/datasets/uk-biobank$ ls -li {sub-1047359/*/*,.git/annex/objects/*/*/*/*} | sort | grep 2360806
2360806 -rwxr-xr-x 2 nguenther nguenther 17619352 May 23  2018 .git/annex/objects/w0/V1/SHA256E-s17619352--eaf7cfe8f31880d2a93adc798a0fb606a0406a6fcf0b6c1dead6019e18f4f2cf.nii.gz/SHA256E-s17619352--eaf7cfe8f31880d2a93adc798a0fb606a0406a6fcf0b6c1dead6019e18f4f2cf.nii.gz
2360806 -rwxr-xr-x 2 nguenther nguenther 17619352 May 23  2018 sub-1047359/anat/sub-1047359_T1w.nii.gz

But as explained in https://git-annex.branchable.com/bugs/annex.hardlink_is_not___34__in_effect__34___in_thin_mode_/#comment-1e59307b5219485f034f29121e8378e6, what happens if we mess with the files?

Consider:

nguenther@data:~/datasets/uk-biobank$ ls -li {sub-1047359/*/*,.git/annex/objects/*/*/*/*} | sort | grep 2360806
2360806 -rwxr-xr-x 2 nguenther nguenther 17619352 May 23  2018 .git/annex/objects/w0/V1/SHA256E-s17619352--eaf7cfe8f31880d2a93adc798a0fb606a0406a6fcf0b6c1dead6019e18f4f2cf.nii.gz/SHA256E-s17619352--eaf7cfe8f31880d2a93adc798a0fb606a0406a6fcf0b6c1dead6019e18f4f2cf.nii.gz
2360806 -rwxr-xr-x 2 nguenther nguenther 17619352 May 23  2018 sub-1047359/anat/sub-1047359_T1w.nii.gz

nguenther@data:~/datasets/uk-biobank$ sha256sum sub-1047359/anat/sub-1047359_T1w.nii.gz
eaf7cfe8f31880d2a93adc798a0fb606a0406a6fcf0b6c1dead6019e18f4f2cf  sub-1047359/anat/sub-1047359_T1w.nii.gz

If we edit it directly, the file is changed but the name it has in .git/annex/objects has the wrong hash in it:

nguenther@data:~/datasets/uk-biobank$ echo "ohnoes" >> sub-1047359/anat/sub-1047359_T1w.nii.gz

nguenther@data:~/datasets/uk-biobank$ sha256sum sub-1047359/anat/sub-1047359_T1w.nii.gz
9d8217a88f8e8e76a1b79107bac9c8abc32a0a6312afc2920d9becf481cea3c3  sub-1047359/anat/sub-1047359_T1w.nii.gz

nguenther@data:~/datasets/uk-biobank$ ls -li {sub-1047359/*/*,.git/annex/objects/*/*/*/*} | sort | grep 2360806
2360806 -rwxr-xr-x 2 nguenther nguenther 17619352 May 23  2018 .git/annex/objects/w0/V1/SHA256E-s17619352--eaf7cfe8f31880d2a93adc798a0fb606a0406a6fcf0b6c1dead6019e18f4f2cf.nii.gz/SHA256E-s17619352--eaf7cfe8f31880d2a93adc798a0fb606a0406a6fcf0b6c1dead6019e18f4f2cf.nii.gz
2360806 -rwxr-xr-x 2 nguenther nguenther 17619352 May 23  2018 sub-1047359/anat/sub-1047359_T1w.nii.gz

But git detects it as modified correctly:

nguenther@data:~/datasets/uk-biobank$ git status
Refresh index: 100% (1192/1192), done.
On branch master
Changes to be committed:
  (use "git restore --staged <file>..." to unstage)
    new file:   dataset_description.json
    new file:   participants.json
    new file:   participants.tsv
    new file:   sub-1000032/anat/sub-1000032_T1w.json
    new file:   sub-1000032/anat/sub-1000032_T1w.nii.gz
        [...]
    new file:   sub-1047359/anat/sub-1047359_T2w.json
    new file:   sub-1047359/anat/sub-1047359_T2w.nii.gz

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
    modified:   sub-1047359/anat/sub-1047359_T1w.nii.gz

And a simple git add sees git-annex detect and fix the inconsistency, renaming the .git/annex/objects file to have "9d8217a88..." in its name. This would seem to contradict this warning

nguenther@data:~/datasets/uk-biobank$ git add sub-1047359/anat/sub-1047359_T1w.nii.gz

nguenther@data:~/datasets/uk-biobank$ ls -li {sub-1047359/*/*,.git/annex/objects/*/*/*/*} | sort | grep 2360806
2360806 -rwxr-xr-x 2 nguenther nguenther 17619359 Dec 10 05:29 .git/annex/objects/gw/1q/SHA256E-s17619359--9d8217a88f8e8e76a1b79107bac9c8abc32a0a6312afc2920d9becf481cea3c3.nii.gz/SHA256E-s17619359--9d8217a88f8e8e76a1b79107bac9c8abc32a0a6312afc2920d9becf481cea3c3.nii.gz
2360806 -rwxr-xr-x 2 nguenther nguenther 17619359 Dec 10 05:29 sub-1047359/anat/sub-1047359_T1w.nii.gz

nguenther@data:~/datasets/uk-biobank$ sha256sum sub-1047359/anat/sub-1047359_T1w.nii.gz
9d8217a88f8e8e76a1b79107bac9c8abc32a0a6312afc2920d9becf481cea3c3  sub-1047359/anat/sub-1047359_T1w.nii.gz

annex.hardlink

Reset:

nguenther@data:~/datasets/uk-biobank$ cd ../..
nguenther@data:~$ time rsync -av /mnt/duke/temp/uk_biobank_BIDS/ datasets/uk-biobank
sending incremental file list
./
sub-1047359/anat/sub-1047359_T1w.nii.gz

sent 17,672,368 bytes  received 748 bytes  7,069,246.40 bytes/sec
total size is 10,209,520,821  speedup is 577.69

real    0m2.034s
user    0m0.036s
sys 0m0.157s
nguenther@data:~$ cd datasets/uk-biobank/
nguenther@data:~/datasets/uk-biobank$ sudo rm -rf .git
[sudo] password for nguenther: 
nguenther@data:~/datasets/uk-biobank$ du -hs .
9.6G    .
nguenther@data:~/datasets/uk-biobank$ git init
Initialized empty Git repository in /home/nguenther/datasets/uk-biobank/.git/
nguenther@data:~/datasets/uk-biobank$ cat > .gitattributes <<EOF
> *.nii     filter=annex annex.largefiles=anything
> *.nii.gz  filter=annex annex.largefiles=anything
> EOF
nguenther@data:~/datasets/uk-biobank$ git add README.txt && git commit -m "Initial commit"
[master (root-commit) 69d0741] Initial commit
 1 file changed, 1 insertion(+)
 create mode 100755 README.txt
nguenther@data:~/datasets/uk-biobank$ git annex init
init  (scanning for unlocked files...)
ok
(recording state in git...)
nguenther@data:~/datasets/uk-biobank$ git add .gitattributes && git commit -m "Configure git-annex"
[master 8337921] Configure git-annex
 1 file changed, 2 insertions(+)
 create mode 100644 .gitattributes
nguenther@data:~/datasets/uk-biobank$ git config annex.hardlink true
nguenther@data:~/datasets/uk-biobank$ time git add .

real    2m18.590s
user    1m1.055s
sys 0m21.077s

nguenther@data:~/datasets/uk-biobank$ du -hs .
20G .
nguenther@data:~/datasets/uk-biobank$ du -hs .git/annex/objects/
9.6G    .git/annex/objects/

Right. Well. There's my answer then. annex.hardlink does nothing.

Did it ever? Maybe it did something in v7 but not in v8?

kousu commented 3 years ago

I thought of a way annex.thin can corrupt data. http://web.archive.org/web/20201210105827/https://git-annex.branchable.com/bugs/annex.thin_can_cause_corrupt___40__not_just_missing__41___data/ doesn't explain this (because it dismisses it as "just missing data") (and maybe it was written with v7 in mind??). Apparently git-annex is smart enough to know how to rename the internal .git/annex/objects file path when its content changes. But here's what can go wrong:

git add annexed-file
git commit -m "v1"

edit annexed-file
git add annexed-file
git commit -m "v2"

edit annexed-file
git add annexed-file
git commit -m "v3"
git annex sync --content

edit annexed-file
git add annexed-file
git commit -m "v4"

git checkout HEAD^2  # annexed-file is irrecoverable here

the trouble is that when v2 was made, it overwrote the v1 file in place, and then the v3 file overwrote the v2 file. Neither v1 nor v2 exist anymore; their hash exists, and was committed to the repo, but that's it. v3 is recoverable, but it will have to be redownloaded from whatever remotes got a copy of it.

I don't think this is a big deal for the way we use this stuff. Just something to watch out for.

kousu commented 3 years ago

I think I understand these now, and my recommendation is:

git config --global annex.thin true
git config --global annex.hardlink false

I've added this to the docs.