spine-generic / data-multi-subject

Multi-subject data for the Spine Generic project
Creative Commons Attribution 4.0 International
21 stars 15 forks source link

Add instruction for working from a fork #129

Closed mguaypaq closed 1 year ago

mguaypaq commented 1 year ago

Fixes #128.

After a long bout of debugging, we figured out that, when working from a clone of a fork of this repository, it's important to have up-to-date versions of the git-annex branch from the fork and from this repository before running git annex get (which will automatically reconcile information from all available git-annex branches). This can be done by configuring this repository as a remote called upstream for the clone, and running git fetch upstream before git annex get.

This pull request adds instructions to the readme to this effect.

Questions for the reviewers:

  1. Are the instructions clear enough?
  2. Is this too much of a niche case to include in the readme?
  3. Should it also be copied to data-single-subject?

@renelabounek, I can't tag you as a reviewer, but if you have any thoughts I'd love to hear them.

renelabounek commented 1 year ago

@mguaypaq: Answers below

  1. Instructions are clear to me
  2. I think it is fine as forks>=10. Anyone of the forking pepole, can need to update database and certainly will forget about this for the first time command try. Git annex is a hassle...
  3. There is lower number of forks, but still >0
kousu commented 1 year ago
2. I think it is fine as forks>=10. Anyone of the forking pepole, can need to update database and certainly will forget about this for the first time command try. Git annex is a hassle...

Agreed. It really is.

I wonder if datalad handles this but I am pretty sure they just don't. Their docs on publishing are here https://handbook.datalad.org/en/latest/basics/basics-thirdparty.html#chapter-thirdparty and they don't mention forks at all

Screenshot 2022-08-17 at 14-16-41 site https __handbook datalad org_ fork at DuckDuckGo

git-annex, and hence datalad, are all really built around the idea of multiple repos owned by a single person, or a single repo shared by multiple people. In all the time I've worked with it I've never seen the workflow function well as soon as you try to go full-distributed version control, which is strange because that's git's primary selling point.

kousu commented 1 year ago

I wonder if this would be better fixed by adding a GitHub Workflow that syncs the git-annex branch from upstream regularly. Their API lets you find out who the upstream is, I used it over in:

https://github.com/neuropoly/gitea/blob/37cf5c9e9b9f18b5d56df5fa9db52d932659e3b3/.github/workflows/sync-upstream.yml#L112-L113

If we can get all forks to run that automatically we should avoid the problem. Right?

mguaypaq commented 1 year ago

I like the idea of using automation to avoid this issue, but I think the regular instructions are still useful. Some thoughts (where I use upstream to refer to this repository, origin to refer to a fork, and local to refer to a clone of the fork):

kousu commented 1 year ago

I like the idea of using automation to avoid this issue, but I think the regular instructions are still useful. Some thoughts (where I use upstream to refer to this repository, origin to refer to a fork, and local to refer to a clone of the fork):

* `origin` may not be on github, actually, but just on some convenient server (on some internal network, for example). Though in that case a cron job could be used instead of a github action.

I was definitely thinking this would be a GitHub specific workaround.

* Modifications in `local` that are pushed to `origin` would mean that `origin/git-annex` would diverge from `upstream/git-annex`, with potential merge conflicts.

I think this is always a risk no matter where you're running git-annex. It somehow takes care to make sure that branch is always mergeable, every time it says like this

(merging upstream/git-annex into git-annex...)

Maybe it's doing something special to avoid merge conflicts that git by itself would run into, but if so we use git annex sync --only-annex instead of git merge upstream git-annex:git-annex.

* I'm not sure I want to end up debugging github action problems on someone else's repository.

This is the one that gives me the most pause. I don't want to do that either. If it's in our repos, we'll get notifications about problems, but in someone else's we won't see it until they come to us scratching their heads.

kousu commented 1 year ago

Sorry it took me so long to review this!

mguaypaq commented 1 year ago

Not at all, thanks for the review, the suggestions, and the thorough testing!

I like your earlier suggestion of:

git config remote.upstream.annex-readonly true
git sync --only-annex --no-content

because it should reliably solve the problem (that git annex can't find file contents) without interfering with the master branch.

With your last suggestion, I think the git pull upstream master possibly does too much: if there are separate changes in upstream:master, and also in origin:master or the local master, then it could merge unwanted changes into the local master, or have merge conflicts. (Here I'm imagining that someone is working from a fork because they want to make non-upstream changes to the dataset.)

kousu commented 1 year ago

Not at all, thanks for the review, the suggestions, and the thorough testing!

I like your earlier suggestion of:

git config remote.upstream.annex-readonly true
git sync --only-annex --no-content

because it should reliably solve the problem (that git annex can't find file contents) without interfering with the master branch.

With your last suggestion, I think the git pull upstream master possibly does too much: if there are separate changes in upstream:master, and also in origin:master or the local master, then it could merge unwanted changes into the local master, or have merge conflicts. (Here I'm imagining that someone is working from a fork because they want to make non-upstream changes to the dataset.)

Isn't that what clicking the Sync button on github and then running git pull does anyway? If there's merge conflicts they will be the same at git pull or git pull origin master. And if they need bomp.nii.gz, say to modify it, add better segmentations, etc, then they need to have both master and git-annex synced.

In fact I realized this while testing.

git config remote.upstream.annex-readonly true
git annex sync --only-annex --no-content

isn't enough. It has to be

git config remote.upstream.annex-readonly true
git annex sync --only-annex --no-content
git pull

Are you thinking about the case where they're if they're just submitting new data in a PR? In that case, they don't need to be synced up.


Maybe we're overcomplicating this. The designed UI is

git annex sync

to do everything in one, but I had reasons for moving away from that:

  1. it does a full download before any uploads (which isn't a problem if you're fully synced up, but is if you're just trying to make a small edit),
  2. it tries to sync with all remotes whether or not you have permission to them (but maybe we can deal with that with git config remote.<name>.annex-readonly?).
  3. It tries to sync all branches, apparently (but maybe we can deal with that with git config --global annex.synconlyannex true?)
mguaypaq commented 1 year ago

There's an unrelated failure in the Dataset Validator, when trying to download the dataset. A single file fails, with:

get sub-mniPilot1/anat/sub-mniPilot1_T1w.nii.gz (from amazon...) 
  download failed: HandshakeFailed (Error_Misc "Network.Socket.recvBuf: resource vanished (Connection reset by peer)")

  Unable to access these remotes: amazon

  Try making some of these repositories available:
    5a5447a8-a9b8-49bc-8276-01a62632b502 -- [amazon]
    5cdba4fc-8d50-4e89-bb0c-a3a4f9449666 -- julien@julien-macbook.local:~/code/spine-generic/data-multi-subject
    9e4d13f3-30e1-4a29-8b86-670879928606 -- alex@NeuroPoly-MacBook-Pro.local:~/data/data-multi-subject
    e405e14e-33b2-4a35-b7a7-3eeec054f0d4 -- sebeda@joplin.neuro.polymtl.ca:/mnt/nvme/sebeda/data-multi-subject

  (Note that these git remotes have annex-ignore set: origin)
failed

Hopefully it's just a random failure; I've re-started the job.

kousu commented 1 year ago

Hopefully it's just a random failure; I've re-started the job.

That's the right call. This is #107!