neuropoly / data-management

Repo that deals with datalad aspects for internal use
4 stars 0 forks source link

`git-annex` leaves cruft remotes around #71

Open kousu opened 3 years ago

kousu commented 3 years ago

git-annex is 3 components afaict: a partial-download system, a bunch of plugins for using different kinds of URLs, and a content-tracking system on top of the two. I think the content-tracking system is a source of a lot of grief for us (e.g. #67). Another source of grief is that datasets implicitly record their paths whenever they are installed anywhere even if only temporarily. And then if they are ever synced back, they will infect the root dataset even without going through a pull request.

For example: https://github.com/spine-generic/data-multi-subject/pull/77#issuecomment-818980337

$ git annex whereis sub-ucdavis06/
whereis sub-ucdavis06/anat/sub-ucdavis06_T1w.nii.gz (1 copy) 
    56bbd6c5-a147-4940-bf73-212f50841743 -- alex@NeuroPoly-MacBook-Pro.local:~/data/data-multi-subject
ok
whereis sub-ucdavis06/anat/sub-ucdavis06_T2star.nii.gz (1 copy) 
    56bbd6c5-a147-4940-bf73-212f50841743 -- alex@NeuroPoly-MacBook-Pro.local:~/data/data-multi-subject
ok
whereis sub-ucdavis06/anat/sub-ucdavis06_T2w.nii.gz (1 copy) 
    56bbd6c5-a147-4940-bf73-212f50841743 -- alex@NeuroPoly-MacBook-Pro.local:~/data/data-multi-subject
ok
whereis sub-ucdavis06/anat/sub-ucdavis06_acq-MToff_MTS.nii.gz (1 copy) 
    56bbd6c5-a147-4940-bf73-212f50841743 -- alex@NeuroPoly-MacBook-Pro.local:~/data/data-multi-subject
ok
whereis sub-ucdavis06/anat/sub-ucdavis06_acq-MTon_MTS.nii.gz (1 copy) 
    56bbd6c5-a147-4940-bf73-212f50841743 -- alex@NeuroPoly-MacBook-Pro.local:~/data/data-multi-subject
ok
whereis sub-ucdavis06/dwi/sub-ucdavis06_dwi.nii.gz (1 copy) 
    56bbd6c5-a147-4940-bf73-212f50841743 -- alex@NeuroPoly-MacBook-Pro.local:~/data/data-multi-subject
ok

I will never be able to connect to Alex's MacBook-Pro. This is a useless piece of information. And keeping it around makes handling merges harder and makes parsing through data harder.

You can see this in other published datasets too. For example, anything on openneuro:

[kousu@requiem spine-generic]$ git clone https://github.com/openneurodatasets/ds003017/
Cloning into 'ds003017'...
remote: Enumerating objects: 4706, done.
remote: Counting objects: 100% (4706/4706), done.
remote: Compressing objects: 100% (3054/3054), done.
remote: Total 4706 (delta 618), reused 4675 (delta 587), pack-reused 0
Receiving objects: 100% (4706/4706), 2.18 MiB | 2.56 MiB/s, done.
Resolving deltas: 100% (618/618), done.
[kousu@requiem spine-generic]$ cd ds003017/
[kousu@requiem ds003017]$ git annex init
init  (scanning for unlocked files...)

  Remote origin not usable by git-annex; setting annex-ignore

  https://github.com/openneurodatasets/ds003017//config download failed: Not Found
(Auto enabling special remote s3-PUBLIC...)
ok
(recording state in git...)
[kousu@requiem ds003017]$ git annex whereis 
whereis code/README.md (2 copies) 
    4b5d5381-6f81-4aac-85b3-1f1e73bf4f34 -- root@openneuro-prod-dataset-worker-2:/datasets/ds003017
    6ae62754-b62b-4285-bc0c-19e1e4445dbd -- [s3-PUBLIC]

  s3-PUBLIC: https://s3.amazonaws.com/openneuro.org/ds003017/code/README.md?versionId=W8MkVFf.JCepGgDkTYWbJ.dF_RiWFfGl
  s3-PUBLIC: https://s3.amazonaws.com/openneuro.org/ds003017/code/README.md?versionId=WqxKw5itAG80nC0ppGPn5k82CI3dE_s2
ok
whereis code/add_intendedfor.py (3 copies) 
    4b5d5381-6f81-4aac-85b3-1f1e73bf4f34 -- root@openneuro-prod-dataset-worker-2:/datasets/ds003017
    6ae62754-b62b-4285-bc0c-19e1e4445dbd -- [s3-PUBLIC]
    e38871b9-5bdf-4ab8-917b-7240302b9267 -- s3-PRIVATE

  s3-PUBLIC: https://s3.amazonaws.com/openneuro.org/ds003017/code/add_intendedfor.py?versionId=APHOfZPZvBDFNnwJaD3sRznOZBOrLbfn
ok
whereis code/add_intendedfor_all.sh (3 copies) 
    4b5d5381-6f81-4aac-85b3-1f1e73bf4f34 -- root@openneuro-prod-dataset-worker-2:/datasets/ds003017
    6ae62754-b62b-4285-bc0c-19e1e4445dbd -- [s3-PUBLIC]
    e38871b9-5bdf-4ab8-917b-7240302b9267 -- s3-PRIVATE

  s3-PUBLIC: https://s3.amazonaws.com/openneuro.org/ds003017/code/add_intendedfor_all.sh?versionId=Ks1Q3md9Tiop3G0I9vLZ6UwyNhx3cgeX
ok
whereis sub-sid000005/anat/sub-sid000005_acq-MPRAGE_T1w.nii.gz (3 copies) 
    4b5d5381-6f81-4aac-85b3-1f1e73bf4f34 -- root@openneuro-prod-dataset-worker-2:/datasets/ds003017
    6ae62754-b62b-4285-bc0c-19e1e4445dbd -- [s3-PUBLIC]
    e38871b9-5bdf-4ab8-917b-7240302b9267 -- s3-PRIVATE

  s3-PUBLIC: https://s3.amazonaws.com/openneuro.org/ds003017/sub-sid000005/anat/sub-sid000005_acq-MPRAGE_T1w.nii.gz?versionId=fGDE6f7cgi7OQEL0zKNyQYEgz2Gi1h2A
[...]

That openneuro-prod-dataset-worker-2 is an ephemeral built bot; no one except openneuro will ever be able to access it, and it should not be published, yet it is.

I've been recommending

git annex dead here

to work around this. The only copies that shouldn't have this are the ones on data.neuro.polymtl.ca or the ones on amazon and those get set automatically. Every working copy should have this, IMO.

git-annex was designed mainly as a personal Dropbox-like system, to corral many disks and cloud accounts into one big meta filesystem, whereas we're using it like we use the rest of git, with collaboration and forking, and these two models don't mesh well.

kousu commented 2 years ago

@mguaypaq proposed today that these cruft remotes aren't just messy, they're a re-identification risk: because git-annex adds a comment to each remote with their user@hostname, and a timestamp, it might be possible to retrace who touched each subject and figure out who they'd scanned.