neuropoly / data-management

Repo that deals with datalad aspects for internal use
4 stars 0 forks source link

Don't recommend `git annex sync` #64

Open kousu opened 3 years ago

kousu commented 3 years ago

git annex sync has a UI inconsistent with the rest of git. It is omnivorous by default: syncing bidirectionally and with all remotes, in the process it creates a plethora of synced/* branches as a workaround for a weakness in git, and it touches all branches not just the working branch and this has performance problems (e.g. https://github.com/neuropoly/data-management/issues/26) as well as causing confusion and bugs when trying to work with a pull-request workflow.

The basic problem, IMO, is that the git-annex metadata is kept in a separate branch shared by all branches, instead of being kept in, say, a hidden .annex subfolder as part of each branch, which means the usual git algorithm can't handle it.

Some alternatives:

kousu commented 3 years ago

Maybe sync is okay, but only if we confine it to the git-annex branch. There's an option to do this:

git config --global annex.synconlyannex true

Then uploading would become:

git push  # sync .git/objects
git annex sync # sync .git/annex/objects + git-annex branch to track them
kousu commented 3 years ago

Another alternative, I think, maybe, is remote.<remote>.annex-speculate-present. This should skip the need for daily updates to the git-annex branch ( https://github.com/neuropoly/data-management/issues/67#issuecomment-819051308 ).

kousu commented 3 years ago

Again: https://github.com/spine-generic/data-multi-subject/issues/93

kousu commented 3 years ago

And again https://github.com/neuropoly/data-management/issues/96#issuecomment-868734099

kousu commented 3 years ago

I discovered a reason we have to recommend it: you need to run it once to initialize each remote's annexuuid:

[kousu@requiem data-single-subject]$ git annex copy --to=praxis-gin
(scanning for unlocked files...)
Enter passphrase for key '/home/kousu/.ssh/id_ed25519.neuropoly': 
Enter passphrase for key '/home/kousu/.ssh/id_ed25519.neuropoly': 
git-annex: cannot determine uuid for praxis-gin (perhaps you need to run "git annex sync"?)

I'm not sure but I would guess this is only an issue with common ssh remotes; I know with Amazon remotes, the annex uuid is created as a specially named file in the S3 bucket.

Apparently I missed removing it entirely from #97, which is lucky because currently: https://github.com/neuropoly/data-management/blob/master/internal-server.md#new-repository recommends

$ git remote add origin git@data.neuro.polymtl.ca:datasets/my-new-repo
$ git annex sync --content origin

which means the annex uuid will get created safely.

I don't know what to do about this. git-annex is impossible.

kousu commented 2 years ago

The explicit git push git-annex:git-annex doesn't work, because the remote git-annex (sometimes) makes its own merge commits leading to

! [rejected] git-annex -> git-annex (non-fast-forward)

It's okay if you're working alone, but as soon as you're collaborating you need something to handle merging the branches.

So we must use some form of git annex sync. Just, probably not its default form.