psychoinformatics-de / knowledge-base

Sources for the psyinf knowledge base
https://knowledge-base.psychoinformatics.de
Other
0 stars 3 forks source link

studyforrest/visualrois dataset has broken special remotes #20

Closed bpoldrack closed 1 year ago

bpoldrack commented 1 year ago

Origin: https://github.com/psychoinformatics-de/studyforrest-data-visualrois/issues/6

TODO (not necessarily to be performed in this order)

bpoldrack commented 1 year ago

For a user, in this particular dataset, the errors manifest as follows:

A datalad clone contains a variety of errors, some of them internal git-annex errors. The command also takes roughly 5 minutes (on my system), until it finally finishes with a "could not connect to server message":

❱ datalad clone git@github.com:psychoinformatics-de/studyforrest-data-visualrois.git
[INFO   ] scanning for annexed files (this may take some time)                  
[INFO   ] Unable to parse git config from origin                                
[INFO   ] Remote origin does not have git-annex installed; setting annex-ignore
|   This could be a problem with the git-annex installation on the remote. Please make sure that git-annex-shell is available in PATH when you ssh into the remote. Once you have fixed the git-annex installation, run: git annex enableremote origin 
[INFO   ] error: remote mddatasrc already exists. 
[INFO   ] git [Param "remote",Param "add",Param "mddatasrc",Param "http://psydata.ovgu.de/studyforrest/freesurfer/.git"] failed 
[INFO   ] RIA store unavailable. -caused by- Failed to access http://studyforrest.ds.inm7.de/ria-layout-version -caused by- Failed to access http://studyforrest.ds.inm7.de/ria-layout-version -caused by- Failed to establish a new session 1 times.  -caused by- HTTPConnectionPool(host='studyforrest.ds.inm7.de', port=80): Max retries exceeded with url: /ria-layout-version (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fd55228fbd0>: Failed to establish a new connection: [Errno -2] Name or service not known')) 

[WARNING] Failed to fetch type=git special remote mddatasrc: CommandError(CommandError: 'git -c diff.ignoreSubmodules=none fetch --verbose --progress mddatasrc' failed with exitcode 128 under /tmp/studyforrest-data-visualrois [err: 'fatal: unable to access 'http://psydata.ovgu.de/studyforrest/visualrois/.git/': Failed to connect to psydata.ovgu.de port 80 after 131126 ms: Couldn't connect to server']) 
^CERROR: 
Interrupted by user while doing magic: KeyboardInterrupt()
datalad clone   8.48s user 2.60s system 4% cpu 4:13.21 total

Although the clone succeeds, and there is a worktree, datalad and git-annex operations appear to stall. Only in git-annex's debug output it is evident that the broken special remote is the cause:

❱ git annex -d -v get --from origin sub-01/rois/lEBA_2_mask.nii.gz        130 !
[2023-04-13 13:30:10.531453004] (Utility.Process) process [138807] read: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","-c","annex.debug=true","show-ref","git-annex"]
[2023-04-13 13:30:10.534125464] (Utility.Process) process [138807] done ExitSuccess
[2023-04-13 13:30:10.534778431] (Utility.Process) process [138808] read: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","-c","annex.debug=true","show-ref","--hash","refs/heads/git-annex"]
[2023-04-13 13:30:10.537371471] (Utility.Process) process [138808] done ExitSuccess
[2023-04-13 13:30:10.538040037] (Utility.Process) process [138809] read: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","-c","annex.debug=true","log","refs/heads/git-annex..8c48b9f3f5944e3a8bf2ef0e64a79683464d5d01","--pretty=%H","-n1"]
[2023-04-13 13:30:10.541055684] (Utility.Process) process [138809] done ExitSuccess
[2023-04-13 13:30:10.541644348] (Utility.Process) process [138810] read: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","-c","annex.debug=true","log","refs/heads/git-annex..9dd5b1b7d608b599d5c22056f7e232b9d34de7a1","--pretty=%H","-n1"]
[2023-04-13 13:30:10.544671337] (Utility.Process) process [138810] done ExitSuccess
[2023-04-13 13:30:10.545880754] (Utility.Process) process [138811] chat: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","-c","annex.debug=true","cat-file","--batch"]
[2023-04-13 13:30:10.586119458] (Utility.Url) Request {
  host                 = "psydata.ovgu.de"
  port                 = 80
  secure               = False
  requestHeaders       = [("Accept-Encoding","identity"),("User-Agent","git-annex/10.20221003")]
  path                 = "/studyforrest/visualrois/.git/config"
  queryString          = ""
  method               = "GET"
  proxy                = Nothing
  rawBody              = False
  redirectCount        = 10
  responseTimeout      = ResponseTimeoutDefault
  requestVersion       = HTTP/1.1
  proxySecureMode      = ProxySecureWithConnect
}

This effects every source of the dataset.

The errors showing up for users are about several special remotes and their setup in this dataset.

1.) The most important issue is the original type git special remote mddatasrc pointing to the no longer existing http://psydata.ovgu.de/studyforrest/visualrois/.git. This needs to be changed to https://datapub.fz-juelich.de/studyforrest/studyforrest/visualrois/.git. This special remote has the UUID 9536f86d-eb34-42ed-8ffc-fafd63a2b87e

2.) For some reason there's a second type git special remote registered under the same name mddatasrc but pointing to http://psydata.ovgu.de/studyforrest/freesurfer/.git. This appears to be a mistake to begin with. At the very least it should be changed to autoenable=false to avoid spamming users with misleading and irrelevant errors. I'd suggest to declare it dead in addition. This special remote has the UUID db2e8480-0894-4e67-93b3-28d0d64d629b.

3.) The ORA special remote pointing to INM-7 is not publicly available but autoenabled. Technically, that's fine and will report the inability to enable it during clone. The message is at INFO level, indicating its nothing to worry about. The message itself, however, is reporting an error. Especially when other errors occur this is misleading. I think it's worth considering to not autoenable it.

Fixing 1), is a bit non-obvious because of the git-type special remote and its interaction with an actual git remote. Right after clone, one can not simply use git annex enableremote mddatasrc location=NEWURL, because git-annex would try to enable the git remote called mddatasrc that was added to .git/config during autoenabling of the git-type special remote. So, git-annex mistakes the name reference. If one uses the UUID instead, however, git annex enableremote will try to git remote add the respective git remote which already exists. Hence, this also fails. That's because enableremote does not seem to consider itself being possibly used for reconfiguration rather than plain enabling in its internal flow. It is however the way to reconfigure. Therefore, what is required is:

git remote remove mddatasrc
git annex enableremote 9536f86d-eb34-42ed-8ffc-fafd63a2b87e location=https://datapub.fz-juelich.de/studyforrest/studyforrest/visualrois/.git

The second call will then reintroduce the git remote with the corrected URL locally. In addition, I am not sure whether the "right" git remote would be there for everyone on every system at this point, because of the second mddatasrc. Removing any mddatasrc git remote from .git/config should work in any case, though.

WRT 2.) I'd suggest to

git annex dead db2e8480-0894-4e67-93b3-28d0d64d629b
git remote remove mddatasrc
git annex enableremote db2e8480-0894-4e67-93b3-28d0d64d629b autoenable=false
git remote remove mddatasrc

Considering the fix for 1.) above, this is more convenient to do first. The git remote remove is done for the same reasons as in 1.)

WRT 3.): This should be a matter of git annex enableremote inm7-storage autoenable=false, but I feel this needs judgement by others.

So, overall:

clone from wherever

git annex dead db2e8480-0894-4e67-93b3-28d0d64d629b
git remote remove mddatasrc
git annex enableremote db2e8480-0894-4e67-93b3-28d0d64d629b autoenable=false
git remote remove mddatasrc
git annex enableremote 9536f86d-eb34-42ed-8ffc-fafd63a2b87e location=https://datapub.fz-juelich.de/studyforrest/studyforrest/visualrois/.git

Running an additional git annex fsck -f mddatasrc --fast may be good, but if the content available from the new location is supposed to be identical it is not strictly necessary.

Ultimately, push, ofc.

One last point: The particular interaction between a git remote and a git-type special remote makes it a bit strange to use enableremote for configuration change. This may be worth pointing out to Joey.

adswa commented 1 year ago

Here's a coordination issue for a collaborative fix: https://github.com/psychoinformatics-de/studyforrest-data/issues/62