psychoinformatics-de / studyforrest-data

DataLad superdataset of all studyforrest.org project dataset components
https://studyforrest.org
8 stars 2 forks source link

Fix mddatasrc for all studyforrest datasets #62

Open adswa opened 1 year ago

adswa commented 1 year ago

At the moment, the Studyforrest datasets hosted here on GitHub are all broken. The reason for this is a faulty special remote mddatasrc pointing to psydata.ovgu.de, which used to redirect to datapub.fz-juelich.de (where the data was migrated to), but was taken down recently. The first user issue that brought this problem to light is https://github.com/psychoinformatics-de/studyforrest-data-visualrois/issues/6.

Although I've only probed a handful of repositories/subdatasets in this repo, I believe they all have a now broken mddatasrc special remote registered. I suggest we put in a coordinated effort to fixing this with as many people as possible. @bpoldrack outlined a fix for this issue in https://github.com/psychoinformatics-de/studyforrest-data-visualrois/issues/6. Here's my translation for the general procedure that anyone can follow:

  1. Take a repo from the list below, tick if off so that others don't duplicate efforts, and clone it from GitHub.
  2. Check if you see errors about mddatasrc during cloning. If not, nevertheless try to retrieve data to make sure it all works. If everything works, move to the next dataset; if not, move to 3.
  3. As a first sanity check, investigate remote.log and make sure there is only one mddatasrc special remote (git cat-file -p git-annex:remote.log is the command to do it). If there are two, leave a note, and move to the next dataset for now.
  4. Make a note of the UUID of the mddatasrc special remote in remote.log
  5. Go to https://datapub.fz-juelich.de/studyforrest/studyforrest/ and find the folder that corresponds to the dataset you're handling. The names aren't always identical, but should be easily inferable. If unsure, compare directory contents and filenames. If you can't find a corresponding directory, ask for help in the chat. Make a note of the URL (e.g., https://datapub.fz-juelich.de/studyforrest/studyforrest/aligned) and append /.git
  6. In the cloned dataset, remove the git remote mddatasrc using git remote remove mddatasrc
  7. Fix the special remote mddatasrc using its UUID as an identifier, and the URL you constructed from datapub.fz-juelich.de (see example below) to fix the location information:
    git annex enableremote 7dd5970d-cee5-404e-a3be-6430ec03657f   location=https://datapub.fz-juelich.de/studyforrest/studyforrest/aligned/.git  
  8. Retrieve a file using datalad get to confirm that this fix worked, and retrieval from mddatasrc is possible again
  9. The fix caused an update in the git-annex branch. datalad push the changes back to GitHub. There is no need (or possibility) to do a pull request. Make sure that the git-annex branch gets successfully pushed. If you run into permission errors, seek help in the chat.
  10. After pushing, re-clone the repo, and retry data retrieval. If things don't work, add a comment to this issue and seek help.

List of repositories:

adswa commented 1 year ago

Here's a log of me doing it for studyforrest-data-aligned:

(handbook) adina@muninn in /tmp
❱ datalad clone git@github.com:psychoinformatics-de/studyforrest-data-aligned.git
[INFO   ] Unable to parse git config from origin                                               
[INFO   ] Remote origin does not have git-annex installed; setting annex-ignore                
[INFO   ] This could be a problem with the git-annex installation on the remote. Please make sure that git-annex-shell is available in PATH when you ssh into the remote. Once you have fixed the git-annex installation, run: git annex enableremote origin 
[INFO   ] RIA store unavailable. -caused by- Failed to access http://studyforrest.ds.inm7.de/ria-layout-version -caused by- Failed to access http://studyforrest.ds.inm7.de/ria-layout-version -caused by- Failed to establish a new session 1 times.  -caused by- HTTPConnectionPool(host='studyforrest.ds.inm7.de', port=80): Max retries exceeded with url: /ria-layout-version (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fe08f3af410>: Failed to establish a new connection: [Errno -2] Name or service not known')) 
^CERROR: 
Interrupted by user while doing magic: KeyboardInterrupt()
(handbook) adina@muninn in /tmp
❱ cd studyforrest-data-aligned                                                             3 !
(handbook) adina@muninn in /tmp/studyforrest-data-aligned on git:master
❱ ls
code          LICENSE    src     sub-02  sub-04  sub-06  sub-10  sub-15  sub-17  sub-19
datacite.yml  README.md  sub-01  sub-03  sub-05  sub-09  sub-14  sub-16  sub-18  sub-20
(handbook) adina@muninn in /tmp/studyforrest-data-aligned on git:master
❱ git cat-file -p git-annex:remote.log
77730816-fef8-459d-9c1c-3bb46a20fe0e archive-id=c8ec2919-493b-4af5-9271-cbe9ebd08c43 autoenable=true encryption=none externaltype=ora name=inm7-storage push-url=ria+ssh://bulk1.htc.inm7.de/ds/studyforrest/srv type=external url=ria+http://studyforrest.ds.inm7.de timestamp=1620023916.544174369s
7dd5970d-cee5-404e-a3be-6430ec03657f autoenable=true location=http://psydata.ovgu.de/studyforrest/aligned/.git name=mddatasrc type=git timestamp=1453280984.013246s
(handbook) adina@muninn in /tmp/studyforrest-data-aligned on git:master
❱ git remote remove mddatasrc
(handbook) adina@muninn in /tmp/studyforrest-data-aligned on git:master
❱ git annex enableremote 7dd5970d-cee5-404e-a3be-6430ec03657f location=https://datapub.fz-juelich.de/studyforrest/studyforrest/aligned/.git

enableremote 7dd5970d-cee5-404e-a3be-6430ec03657f ok
(recording state in git...)
(handbook) adina@muninn in /tmp/studyforrest-data-aligned on git:master
❱ datalad get sub-01/in_bold3Tp2/sub-01_task-avmovie_run-1_bold_mcparams.txt
get(ok): sub-01/in_bold3Tp2/sub-01_task-avmovie_run-1_bold_mcparams.txt (file) [from mddatasrc...]
(handbook) adina@muninn in /tmp/studyforrest-data-aligned on git:master
❱ datalad push 
publish(ok): . (dataset) [refs/heads/git-annex->origin:refs/heads/git-annex 304f2250..3a9c6331]
action summary:                                                                                
  publish (notneeded: 1, ok: 1)
(handbook) adina@muninn in /tmp/studyforrest-data-aligned on git:master
❱ cd ..
(handbook) adina@muninn in /tmp
❱ datalad clone git@github.com:psychoinformatics-de/studyforrest-data-aligned.git t
[INFO   ] Unable to parse git config from origin                                               
[INFO   ] Remote origin does not have git-annex installed; setting annex-ignore                
[INFO   ] This could be a problem with the git-annex installation on the remote. Please make sure that git-annex-shell is available in PATH when you ssh into the remote. Once you have fixed the git-annex installation, run: git annex enableremote origin 
[INFO   ] RIA store unavailable. -caused by- Failed to access http://studyforrest.ds.inm7.de/ria-layout-version -caused by- Failed to access http://studyforrest.ds.inm7.de/ria-layout-version -caused by- Failed to establish a new session 1 times.  -caused by- HTTPConnectionPool(host='studyforrest.ds.inm7.de', port=80): Max retries exceeded with url: /ria-layout-version (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f60bec93810>: Failed to establish a new connection: [Errno -2] Name or service not known')) 
install(ok): /tmp/t (dataset)
(handbook) adina@muninn in /tmp
❱ cd t
(handbook) adina@muninn in /tmp/t on git:master
❱ datalad get sub-01/in_bold3Tp2/sub-01_task-avmovie_run-1_bold_mcparams.txt
get(ok): sub-01/in_bold3Tp2/sub-01_task-avmovie_run-1_bold_mcparams.txt (file) [from mddatasrc...]
christian-monch commented 1 year ago

Thanks @adswa for the great instructions!

I have changed the mddatasrc location in https://github.com/psychoinformatics-de/studyforrest-data-phase2, and datalad get works now.

I get the same [INFO]-message about an unavailable RIA store, which I assume is OK, right?

adswa commented 1 year ago

Yes, this message is unrelated to the special remote 👍

jsheunis commented 1 year ago

This is what I get for https://github.com/psychoinformatics-de/studyforrest-data-aggregate.

Two git-annex remotes, nothing about mddatasrc, access errors:

❱ datalad clone https://github.com/psychoinformatics-de/studyforrest-data-aggregate.git
[INFO   ] Remote origin not usable by git-annex; setting annex-ignore
[INFO   ] https://github.com/psychoinformatics-de/studyforrest-data-aggregate.git/config download failed: Not Found
[INFO   ] RIA store unavailable. -caused by- Failed to access http://studyforrest.ds.inm7.de/ria-layout-version -caused by- Failed to access http://studyforrest.ds.inm7.de/ria-layout-version -caused by- Failed to establish a new session 1 times.  -caused by- HTTPConnectionPool(host='studyforrest.ds.inm7.de', port=80): Max retries exceeded with url: /ria-layout-version (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f97dccd2ad0>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known'))
[WARNING] Failed to fetch type=git special remote psydata: CommandError(CommandError: 'git -c diff.ignoreSubmodules=none fetch --verbose --progress psydata' failed with exitcode 128 under /Users/jsheunis/Documents/psyinf/Data/studyforrest-data-aggregate [err: 'fatal: unable to access 'http://psydata.ovgu.de/studyforrest/aggregate/.git/': Failed to connect to psydata.ovgu.de port 80: Operation timed out'])
install(ok): /Users/jsheunis/Documents/psyinf/Data/studyforrest-data-aggregate (dataset)

❱ git cat-file -p git-annex:remote.log
11d89be1-d3e3-4803-8ba2-c168411b4e80 autoenable=true location=http://psydata.ovgu.de/studyforrest/aggregate/.git name=psydata type=git timestamp=1511528268.553343287s
b1cffbff-ef07-4f22-a736-53d92eeb2c7a archive-id=7fcd8812-d0fe-11e7-8db2-a0369f7c647e autoenable=true encryption=none externaltype=ora name=inm7-storage push-url=ria+ssh://bulk1.htc.inm7.de/ds/studyforrest/srv type=external url=ria+http://studyforrest.ds.inm7.de timestamp=1620022089.160920561s

❱ datalad get sub-01/atlases/bold3Tp2/shen_fconn_atlas_150.nii.gz
get(error): sub-01/atlases/bold3Tp2/shen_fconn_atlas_150.nii.gz (file) [Remote psydata not usable by git-annex; setting annex-ignore
http://psydata.ovgu.de/studyforrest/aggregate/.git/config download failed: ConnectionFailure Network.Socket.connect: <socket: 19>: timeout (Operation timed out)]
adswa commented 1 year ago

Thx! In this case, the remote is not called mddatasrc but psydata - can you replace all comments with mddatasrc with psydata? I believe this should do the trick. Thank you so much! :)

adswa commented 1 year ago

https://github.com/psychoinformatics-de/studyforrest-data-templatetransforms (fixed, but some get "impossible" and "errors" remain. See datalad get sub-01)

I'm investigating :+1:

jsheunis commented 1 year ago

Thx! In this case, the remote is not called mddatasrc but psydata - can you replace all comments with mddatasrc with psydata? I believe this should do the trick. Thank you so much! :)

enabled the new remote with the correct location, but still getting errors when retrieving file content:

> git remote remove psydata

> git annex enableremote 11d89be1-d3e3-4803-8ba2-c168411b4e80 location=https://datapub.fz-juelich.de/studyforrest/studyforrest/aggregate/.git
enableremote 11d89be1-d3e3-4803-8ba2-c168411b4e80 ok
(recording state in git...)

> datalad get sub-01/atlases/bold3Tp2/shen_fconn_atlas_150.nii.gz
get(error): sub-01/atlases/bold3Tp2/shen_fconn_atlas_150.nii.gz (file) [download failed: Not Found
failed to download content
download failed: Not Found
failed to download content
download failed: Not Found
failed to download content]

with debug:

[DEBUG  ] received JSON result from annex: {'command': 'get', 'error-messages': ['  download failed: Not Found', '  failed to download content', '  download failed: Not Found', '  failed to download content', '  download failed: Not Found', '  failed to download content'], 'file': 'sub-16/atlases/bold3Tp2/shen_fconn_atlas_150.nii.gz', 'input': ['sub-16/atlases/bold3Tp2/shen_fconn_atlas_150.nii.gz'], 'key': 'MD5E-s40787--8ade422c9c0d49788f3b6ad793f81b9c.nii.gz', 'note': 'from psydata...\nUnable to access these remotes: psydata\n(Note that these git remotes have annex-ignore set: origin)', 'success': False, 'wanted': [{'description': 'mih@meiner:~/forrest/collection/aggregate', 'here': False, 'uuid': '11d60f59-d220-4b86-9b84-7fdcfe6937c7'}, {'description': '', 'here': False, 'uuid': '272356f8-65a0-4ba5-a217-1c7ebb97903d'}, {'description': 'mih@medusa:/home/data/psyinf/forrest_gump/collection/aggregate', 'here': False, 'uuid': '2d364a44-eb57-4a88-9c75-a1a22fbabfeb'}, {'description': 'inm7-storage', 'here': False, 'uuid': 'b1cffbff-ef07-4f22-a736-53d92eeb2c7a'}, {'description': 'git@82709b2ed170:/data/repos/studyforrest/aggregate-fmri-timeseries.git', 'here': False, 'uuid': 'f2cd7b91-6ce6-490f-b2dd-21bce9b90b6b'}]}
adswa commented 1 year ago

Thx, I will investigate and report back what I found! :+1:

adswa commented 1 year ago

Edit: This was fixed and pushed. Done!

First observation about https://github.com/psychoinformatics-de/studyforrest-data-aggregate: Some files on datapub.fz-juelich.de seem access-restricted. I don't know what to do here, so I'll tag @aqw and @mih for potential insights

It affects the following files: https://datapub.fz-juelich.de/studyforrest/studyforrest/aggregate/atlases/shen/fconn_atlas_150_1mm.nii.gz https://datapub.fz-juelich.de/studyforrest/studyforrest/aggregate/atlases/shen/fconn_atlas_150_2mm.nii.gz https://datapub.fz-juelich.de/studyforrest/studyforrest/aggregate/sub-01/atlases/bold3Tp2/shen_fconn_atlas_150.nii.gz https://datapub.fz-juelich.de/studyforrest/studyforrest/aggregate/sub-02/atlases/bold3Tp2/shen_fconn_atlas_150.nii.gz https://datapub.fz-juelich.de/studyforrest/studyforrest/aggregate/sub-03/atlases/bold3Tp2/shen_fconn_atlas_150.nii.gz https://datapub.fz-juelich.de/studyforrest/studyforrest/aggregate/sub-04/atlases/bold3Tp2/shen_fconn_atlas_150.nii.gz https://datapub.fz-juelich.de/studyforrest/studyforrest/aggregate/sub-05/atlases/bold3Tp2/shen_fconn_atlas_150.nii.gz https://datapub.fz-juelich.de/studyforrest/studyforrest/aggregate/sub-06/atlases/bold3Tp2/shen_fconn_atlas_150.nii.gz https://datapub.fz-juelich.de/studyforrest/studyforrest/aggregate/sub-09/atlases/bold3Tp2/shen_fconn_atlas_150.nii.gz https://datapub.fz-juelich.de/studyforrest/studyforrest/aggregate/sub-10/atlases/bold3Tp2/shen_fconn_atlas_150.nii.gz https://datapub.fz-juelich.de/studyforrest/studyforrest/aggregate/sub-14/atlases/bold3Tp2/shen_fconn_atlas_150.nii.gz https://datapub.fz-juelich.de/studyforrest/studyforrest/aggregate/sub-15/atlases/bold3Tp2/shen_fconn_atlas_150.nii.gz https://datapub.fz-juelich.de/studyforrest/studyforrest/aggregate/sub-16/atlases/bold3Tp2/shen_fconn_atlas_150.nii.gz https://datapub.fz-juelich.de/studyforrest/studyforrest/aggregate/sub-17/atlases/bold3Tp2/shen_fconn_atlas_150.nii.gz https://datapub.fz-juelich.de/studyforrest/studyforrest/aggregate/sub-18/atlases/bold3Tp2/shen_fconn_atlas_150.nii.gz https://datapub.fz-juelich.de/studyforrest/studyforrest/aggregate/sub-19/atlases/bold3Tp2/shen_fconn_atlas_150.nii.gz https://datapub.fz-juelich.de/studyforrest/studyforrest/aggregate/sub-20/atlases/bold3Tp2/shen_fconn_atlas_150.nii.gz

Edit: I only now realized that this were all annexed files in the dataset, the rest is in Git

adswa commented 1 year ago

EDIT: The problem is that the dataset on https://datapub.fz-juelich.de/studyforrest/studyforrest/templatetransforms/.git/ contains an old version, with the last commit from 2016. The dataset on GitHub has more recent commits. They seem to originate from juseless, but other than the commits, these changes were not published. If we push this dataset from data1:/data/project/studyforrest/superds/derivative/image_space_transformations to datapub, this should get fixed. I don't have permissions to do this.

For https://github.com/psychoinformatics-de/studyforrest-data-templatetransforms I also need some help, so I'm tagging @mih and @bpoldrack:

There are files that can't be retrieved, e.g., sub-01/bold3Tp2/in_t1w/brain.nii.gz sub-01/bold3Tp2/in_t1w/xfm_6dof.mat sub-01/t1w/in_bold3Tp2/brain.nii.gz sub-01/t1w/in_bold3Tp2/xfm_6dof.mat

This is the availability information registered for those files (exemplary for one, matches all of them) - the important bit is that the enabled [mddatasrc] special remote isn't listed.

❱ git annex whereis sub-01/bold3Tp2/in_t1w/xfm_6dof.mat
whereis sub-01/bold3Tp2/in_t1w/xfm_6dof.mat (4 copies) 
    43613943-720c-4018-a7a5-40c6fb9ad603 -- inm7-storage
    529fccea-fdf5-4266-99a4-769e2638f82f -- mih@medusa:/home/data/psyinf/forrest_gump/collection/tnt
    a6358f69-bae7-4035-a9b8-7751eb3d9144 -- git@82709b2ed170:/data/repos/studyforrest/imagespace-transformations.git
    f2ec3af6-e466-4951-b1dc-4991ade8f171 -- mih@data1:/data/project/studyforrest/superds/derivative/image_space_transformations
ok

However, the files are available at mddatasrc, for example https://datapub.fz-juelich.de/studyforrest/studyforrest/templatetransforms/sub-01/bold3Tp2/in_t1w.

I already did an git annex fsck --from mddatasrc which reported success, but did not update availability. My question is: how can I tell git-annex that for those files in question mddatasrc is a suitable location, too? Is it a job for addurls?

A side question is whether those files are left unregistered on purpose, e.g., because of data privacy.

adswa commented 1 year ago

As for https://github.com/psychoinformatics-de/studyforrest-data-phase2-denoised , we don't have this data, all sources are with OpenNeuro as far as I can see.

Edit: The dataset here on github is outdated. The problem is that the data was updated upstream, and the content from the now unavailable files was moved to *_decomposition.json in commit de145f67a3da26f1d39187403340d7380d928cf2 tag 1.3.0. ~I will get the dataset in sync with the one from OpenNeuro~ After a quick discussion in the chat, we decided to add a fork to the OpenNeuro Dataset instead of synching.

adswa commented 1 year ago

A quick overview of a TODO for @mih:

bpoldrack commented 1 year ago

@adswa

I already did an git annex fsck --from mddatasrc which reported success, but did not update availability.

That is strange. It should update availability if there was a change and it would be the way to go. When you say "reported success", do you mean a zero exit of the command or that it reported to find those files via special remote?

adswa commented 1 year ago

See my edit in that post, and most recent comment to @mih with a fix, @bpoldrack: The files in question differ in version between GitHub and datapub. Datapub is outdated. It does not know the updated annex keys that GitHub knows about (but doesn't carry). So while a file on GitHub points to annex/objects/zW/..., datapub does not have this in its object tree yet (because this version of the file wasn't pushed to it yet, it only lives on data1). I figured this when I tried to run git annex setpresentkey to manually add the mddatasrc to the key.

adswa commented 1 year ago

Another TODO for @mih:

I lack the permissions to do so, and this dataset is superfluous as I have forked the openneuro dataset as discussed in the chat as a maintained alternative to https://github.com/psychoinformatics-de/studyforrest-data-phase2-denoised_openneuro

TODO for me: