Usage of pipeline ... and error when trying recursive update

sappelhoff commented 5 years ago

Hi, I am trying to turn an OSF directory into a git annex repository and datalad-osf seems to be great for this.

I am not entirely sure whether this would work as I think it would, basically I expect:

Make a new directory and call datalad create
call update_recursive with my OSF key and the directory I want to be git annexed
upload the new directory e.g., to GitHub and get the repository URL
be able to pass this URL to datalad install and get data (that is stored on OSF, but indexed in my git annex repository)

Can someone tell me whether I am just completely misunderstanding / misusing the pipeline? @yarikoptic @rciric

Is there a simpler way to achieve what I want?

Apart from this, this is also a bug report. Here are the steps to reproduce:

mkdir mystuff
cd mystuff
datalad create

into mystuff, put the following Python file try.py

key='cj2dr'
subset='eeg_matchingpennies'

import datalad_osf

datalad_osf.update_recursive(key, subset)

from mystuff run python try.py

... after a considerable amount of time, this provides me with the following error message:

Traceback (most recent call last):
  File "try.py", line 6, in <module>
    datalad_osf.update_recursive(key, subset)
  File "/home/stefanappelhoff/Desktop/datalad-osf/datalad_osf/utils.py", line 186, in update_recursive
    addurls_from_csv(csv)
  File "/home/stefanappelhoff/Desktop/datalad-osf/datalad_osf/utils.py", line 65, in addurls_from_csv
    ifexists='overwrite')
  File "/home/stefanappelhoff/miniconda3/lib/python3.7/site-packages/datalad/interface/utils.py", line 492, in eval_func
    return return_func(generator_func)(*args, **kwargs)
  File "/home/stefanappelhoff/miniconda3/lib/python3.7/site-packages/datalad/interface/utils.py", line 480, in return_func
    results = list(results)
  File "/home/stefanappelhoff/miniconda3/lib/python3.7/site-packages/datalad/interface/utils.py", line 429, in generator_func
    result_renderer, result_xfm, _result_filter, **_kwargs):
  File "/home/stefanappelhoff/miniconda3/lib/python3.7/site-packages/datalad/interface/utils.py", line 522, in _process_results
    for res in results:
  File "/home/stefanappelhoff/miniconda3/lib/python3.7/site-packages/datalad/plugin/addurls.py", line 719, in __call__
    missing_value)
  File "/home/stefanappelhoff/miniconda3/lib/python3.7/site-packages/datalad/plugin/addurls.py", line 407, in extract
    metacols = (c for c in sorted(rows[0].keys()) if c != urlcol)
IndexError: list index out of range

yarikoptic commented 5 years ago

myself I haven't used datalad-osf yet. From traceback sounds like some empty row? if you run with datalad --dbg - you could get into pdb and troubleshoot details, or just check out that csv?

sappelhoff commented 5 years ago

or just check out that csv?

there is no csv in my dataset :thinking:

BUT having talked to @jasmainak a bit it seems like my premise is wrong.

I thought I could create a git annex repo that would look JUST LIKE my real dataset, but instead of the real data, it would contain symbolic links pointing to the OSF data.

And then I would be able to host that git annex repo (very low size) on GitHub, allow people to pull it with datalad, and use datalad.api.get() to download the data from OSF.

According to Mainak I would need my own git server to do something like that.

Apparently datalad_osf is just a Python API to download OSF files (the true files) and add them to a local git annex repo (with the full files).

yarikoptic commented 5 years ago

According to Mainak I would need my own git server to do something like that.

I don't think so. git-annex will just contain urls pointing to OSF

Apparently datalad_osf is just a Python API to download OSF files (the true files) and add them to a local git annex repo (with the full files).

yeap, and then you can publish that repository to github, along with git-annex branch (datalad publish does that) so anyone who clones it should be able to get actual files using git annex get or datalad get from OSF. So the premise is right as far as I see ;)

yarikoptic commented 5 years ago

didn't look into anything else but just FYI that the fetched csv has only the header.

$> cat eeg_matchingpennies 
name,url,location,sha256,path

as for datalad crashing instead of just silently exiting or issuing a warning that no records were received, I filed https://github.com/datalad/datalad/issues/3577

sappelhoff commented 5 years ago

didn't look into anything else but just FYI that the fetched csv has only the header.

Mh, yes - this is a bug, also the test example from the main README fails, perhaps we should wait for @rciric to work this out.

In the meantime, do you have a pointer to docs / tutorials how to do what I want to (see above, my "premise") using just datalad?

kyleam commented 5 years ago

In the meantime, do you have a pointer to docs / tutorials how to do what I want to (see above, my "premise") using just datalad?

I don't believe we have a high-level tutorial on addurls yet. But here's a quick example using a couple of the URLs from the OSF directory that you pointed to. This skips past the more involved task, which IIUC datalad-osf handles, of getting a set of stable URLs and putting them into either a .json or .csv file that addurls() understands.

example

```bash #!/bin/sh set -eu datalad create someds cd someds cat >files.csv < .git/annex/objects/[...]7f913d093561b0b385d076a32d1ea9f1.csv |-- sub-05 -> .git/annex/objects/[...]f68f6c37ac758d82cd8c7d95dee70bbf `-- sub-06 -> .git/annex/objects/[...]ecda9020e4f012517f531e5be571e8db ``` The public URLs for these files have been registered with git-annex: ``` > someds $ git annex whereis whereis sub-05 (2 copies) 00000000-0000-0000-0000-000000000001 -- web 6411a76e-97a7-4c98-80a7-a9832599ddff -- kyle@hylob:~/scratch/dl/addurls-examples/someds [here] web: https://osf.io/5br27/download ok whereis sub-06 (2 copies) 00000000-0000-0000-0000-000000000001 -- web 6411a76e-97a7-4c98-80a7-a9832599ddff -- kyle@hylob:~/scratch/dl/addurls-examples/someds [here] web: https://osf.io/9q8r2/download ok ``` This means that you can publish the repository without the data and people who have cloned it will be able to get the files with `{git annex,datalad} get`. (This requires publishing the git-annex branch.) You can verify locally that this works by cloning the repo and then dropping the origin remote, so the only place annex can get the content from is the web. ``` $ datalad install -s someds clone $ cd clone $ git annex dead origin $ git remote rm origin $ git annex whereis whereis sub-05 (1 copy) 00000000-0000-0000-0000-000000000001 -- web web: https://osf.io/5br27/download ok whereis sub-06 (1 copy) 00000000-0000-0000-0000-000000000001 -- web web: https://osf.io/9q8r2/download ok $ git annex get sub-05 get sub-05 (from web...) (checksum...) ok (recording state in git...) ```

sappelhoff commented 5 years ago

Wow, this is really great @kyleam thanks!

sappelhoff commented 5 years ago

This seems to have worked to a large extend! I have made a CSV file with my file-paths and urls: "mp.csv" and made a datalad set:

CSV content for convenience

~~~ fpath,url eeg_matchingpennies/sub-05/eeg/sub-05_task-matchingpennies_channels.tsv,https://osf.io/wdb42/download eeg_matchingpennies/sub-05/eeg/sub-05_task-matchingpennies_eeg.eeg,https://osf.io/3at5h/download eeg_matchingpennies/sub-05/eeg/sub-05_task-matchingpennies_eeg.vhdr,https://osf.io/3m8et/download eeg_matchingpennies/sub-05/eeg/sub-05_task-matchingpennies_eeg.vmrk,https://osf.io/7gq4s/download eeg_matchingpennies/sub-05/eeg/sub-05_task-matchingpennies_events.tsv,https://osf.io/9q8r2/download eeg_matchingpennies/sourcedata/sub-05/eeg/sub-05_task-matchingpennies_eeg.xdf,https://osf.io/agj2q/download eeg_matchingpennies/sub-06/eeg/sub-06_task-matchingpennies_channels.tsv,https://osf.io/256sk/download eeg_matchingpennies/sub-06/eeg/sub-06_task-matchingpennies_eeg.eeg,https://osf.io/p52dn/download eeg_matchingpennies/sub-06/eeg/sub-06_task-matchingpennies_eeg.vhdr,https://osf.io/jk649/download eeg_matchingpennies/sub-06/eeg/sub-06_task-matchingpennies_eeg.vmrk,https://osf.io/wdjk9/download eeg_matchingpennies/sub-06/eeg/sub-06_task-matchingpennies_events.tsv,https://osf.io/5br27/download eeg_matchingpennies/sourcedata/sub-06/eeg/sub-06_task-matchingpennies_eeg.xdf,https://osf.io/rj3nf/download eeg_matchingpennies/sub-07/eeg/sub-07_task-matchingpennies_channels.tsv,https://osf.io/qvze6/download eeg_matchingpennies/sub-07/eeg/sub-07_task-matchingpennies_eeg.eeg,https://osf.io/z792x/download eeg_matchingpennies/sub-07/eeg/sub-07_task-matchingpennies_eeg.vhdr,https://osf.io/2an4r/download eeg_matchingpennies/sub-07/eeg/sub-07_task-matchingpennies_eeg.vmrk,https://osf.io/u7v2g/download eeg_matchingpennies/sub-07/eeg/sub-07_task-matchingpennies_events.tsv,https://osf.io/uyhtd/download eeg_matchingpennies/sourcedata/sub-07/eeg/sub-07_task-matchingpennies_eeg.xdf,https://osf.io/aqesz/download eeg_matchingpennies/sub-08/eeg/sub-08_task-matchingpennies_channels.tsv,https://osf.io/4safg/download eeg_matchingpennies/sub-08/eeg/sub-08_task-matchingpennies_eeg.eeg,https://osf.io/dg9b4/download eeg_matchingpennies/sub-08/eeg/sub-08_task-matchingpennies_eeg.vhdr,https://osf.io/w6kn2/download eeg_matchingpennies/sub-08/eeg/sub-08_task-matchingpennies_eeg.vmrk,https://osf.io/mrkag/download eeg_matchingpennies/sub-08/eeg/sub-08_task-matchingpennies_events.tsv,https://osf.io/u76fs/download eeg_matchingpennies/sourcedata/sub-08/eeg/sub-08_task-matchingpennies_eeg.xdf,https://osf.io/6t5vg/download eeg_matchingpennies/sub-09/eeg/sub-09_task-matchingpennies_channels.tsv,https://osf.io/nqjfm/download eeg_matchingpennies/sub-09/eeg/sub-09_task-matchingpennies_eeg.eeg,https://osf.io/6m5ez/download eeg_matchingpennies/sub-09/eeg/sub-09_task-matchingpennies_eeg.vhdr,https://osf.io/btv7d/download eeg_matchingpennies/sub-09/eeg/sub-09_task-matchingpennies_eeg.vmrk,https://osf.io/daz4f/download eeg_matchingpennies/sub-09/eeg/sub-09_task-matchingpennies_events.tsv,https://osf.io/ue7ah/download eeg_matchingpennies/sourcedata/sub-09/eeg/sub-09_task-matchingpennies_eeg.xdf,https://osf.io/59zde/download eeg_matchingpennies/sub-10/eeg/sub-10_task-matchingpennies_channels.tsv,https://osf.io/5cfmh/download eeg_matchingpennies/sub-10/eeg/sub-10_task-matchingpennies_eeg.eeg,https://osf.io/ya8kr/download eeg_matchingpennies/sub-10/eeg/sub-10_task-matchingpennies_eeg.vhdr,https://osf.io/he3c2/download eeg_matchingpennies/sub-10/eeg/sub-10_task-matchingpennies_eeg.vmrk,https://osf.io/bw6fp/download eeg_matchingpennies/sub-10/eeg/sub-10_task-matchingpennies_events.tsv,https://osf.io/r5ydt/download eeg_matchingpennies/sourcedata/sub-10/eeg/sub-10_task-matchingpennies_eeg.xdf,https://osf.io/gfsnv/download eeg_matchingpennies/sub-11/eeg/sub-11_task-matchingpennies_channels.tsv,https://osf.io/6p8vr/download eeg_matchingpennies/sub-11/eeg/sub-11_task-matchingpennies_eeg.eeg,https://osf.io/ywnpg/download eeg_matchingpennies/sub-11/eeg/sub-11_task-matchingpennies_eeg.vhdr,https://osf.io/p7xk2/download eeg_matchingpennies/sub-11/eeg/sub-11_task-matchingpennies_eeg.vmrk,https://osf.io/8u5fm/download eeg_matchingpennies/sub-11/eeg/sub-11_task-matchingpennies_events.tsv,https://osf.io/rjzhy/download eeg_matchingpennies/sourcedata/sub-11/eeg/sub-11_task-matchingpennies_eeg.xdf,https://osf.io/4m3g5/download eeg_matchingpennies/.bidsignore,https://osf.io/6thgf/download eeg_matchingpennies/CHANGES,https://osf.io/ckmbf/download eeg_matchingpennies/dataset_description.json,https://osf.io/tsy4c/download eeg_matchingpennies/LICENSE,https://osf.io/mkhd4/download eeg_matchingpennies/participants.tsv,https://osf.io/6mceu/download eeg_matchingpennies/participants.json,https://osf.io/ku2dn/download eeg_matchingpennies/README,https://osf.io/k8hjf/download eeg_matchingpennies/task-matchingpennies_eeg.json,https://osf.io/qf5d8/download eeg_matchingpennies/task-matchingpennies_events.json,https://osf.io/3qztv/download eeg_matchingpennies/stimuli/left_hand.png,https://osf.io/g45de/download eeg_matchingpennies/stimuli/right_hand.png,https://osf.io/2r9zd/download ~~~

datalad create eeg_matchingpennies datalad addurls mp.csv "{url}" "{fpath}" -d eeg_matchingpennies/

(Note: I did not commit the csv to the repo, because I thought it's not necessary)

There seems to be a bug however with some of the files:

cd eeg_matchingpennies git annex whereis

for some files, this prints several links, all except one are wrong, E.g.:

whereis eeg_matchingpennies/sub-11/eeg/sub-11_task-matchingpennies_channels.tsv (1 copy) 
    00000000-0000-0000-0000-000000000001 -- web

  web: https://osf.io/4safg/download
  web: https://osf.io/5cfmh/download
  web: https://osf.io/6p8vr/download
  web: https://osf.io/nqjfm/download
  web: https://osf.io/qvze6/download
ok

I checked the CSV file, and it does not seem to be the source of the error. Can either of you reproduce this error @yarikoptic @kyleam ?

Separate question: I continued as @kyleam suggested to make a local clone and remove the origin, to get a publishable git-annex dataset with only the "web" source of the data.

See: https://github.com/sappelhoff/bogus

apparently something went wrong - can you tell me what I should do?

After cloning and removing the origin, I did (with the clone):

make a new GitHub repository
in the clone run git remote add origin https://github.com/sappelhoff/bogus
run git push origin master

When I realized that this does not look right, I figured that datalad publish might be the way to go, so I tried (on top of the previous steps):

from the root of the clone: datalad publish . --to origin --force

But all that gave me was a cryptic "git-annex" branch ...

I now want to use datalad install https://github.com/sappelhoff/bogus, do I first have to merge the git-annex branch into master? Do I leave both branches untouched?

Is this the right way to go at all?

yarikoptic commented 5 years ago

Just go ahead with datalad install https://github.com/sappelhoff/bogus

git-annex branch should never be merged into any normal branch. Leave it for git-annex to deal with

kyleam commented 5 years ago

@sappelhoff:

for some files, this prints several links, all except one are wrong, E.g.: [...] I checked the CSV file, and it does not seem to be the source of the error. Can either of you reproduce this error @yarikoptic @kyleam ?

Hrm that's odd.

I tried with --fast first, and all of the urls look ok on my end (i.e. I see only one web entry for each URL). Here's the one from the example:

$ git annex whereis eeg_matchingpennies/sub-11/eeg/sub-11_task-matchingpennies_channels.tsv
whereis eeg_matchingpennies/sub-11/eeg/sub-11_task-matchingpennies_channels.tsv (1 copy) 
    00000000-0000-0000-0000-000000000001 -- web

  web: https://osf.io/6p8vr/download
ok

I'm trying now without --fast.

I'm running this with datalad 0.11.6 and git-annex 7.20190730+git2-ga63bf35dc-1~ndall+1 on GNU/Linux. What's your version info?

sappelhoff commented 5 years ago

Thanks Yaroslav, I'll try that later!

@kyleam I am using:

datalad 0.12.0rc4.dev311 (installed via pip install -e. from my clone of master)
git-annex version: 7.20190730-g1030771 (installed from conda-forge)
operating system: linux x86_64 (Ubuntu 18.04)

good to hear that it works with --fast ... I am excited what you'll see without it.

However, reading what --fast does, I should perhaps used that in the first place, because I am later on purging the local data anyhow :-)

kyleam commented 5 years ago

good to hear that it works with --fast ... I am excited what you'll see without it.

Without --fast I see repeats, including the example you point to:

$ git annex whereis eeg_matchingpennies/sub-11/eeg/sub-11_task-matchingpennies_channels.tsv
whereis eeg_matchingpennies/sub-11/eeg/sub-11_task-matchingpennies_channels.tsv (2 copies) 
    00000000-0000-0000-0000-000000000001 -- web
    24081d41-a5ee-434b-a58a-4401106dc189 -- foo [here]

  web: https://osf.io/4safg/download
  web: https://osf.io/5cfmh/download
  web: https://osf.io/6p8vr/download
  web: https://osf.io/nqjfm/download
  web: https://osf.io/qvze6/download
ok

It seems there has to be something going wrong in the underlying git annex addurl --batch call, but I don't know whether it's on our end (in AnnexRepo, not addurls.py) or git-annex's. Some time next week I'll try to see if I can trigger the issue using git-annex directly.

kyleam commented 5 years ago

Aah, it should've occurred to me sooner, but that could happen if those keys have the same content, and the files indeed point to the same key for all the cases I've checked. So I think things are working as expected.

kyleam commented 5 years ago

@sappelhoff:

However, reading what --fast does, I should perhaps used that in the first place, because I am later on purging the local data anyhow :-)

It's more expensive, but leaving out --fast buys you a content guarantee. With --fast, future downloads will only verify that the file has the expected size.

You can see this difference by looking at the link targets. Without --fast, you get a file that points to the key generated from the file's content:

test3 -> .git/annex/objects/wj/6x/SHA256E-s250--dd8[...]7a0/SHA256E-s250--dd8[...]7a0

With --fast, the target only encodes the size:

test4 -> '.git/annex/objects/81/K7/URL-s250--https&c%%osf.io%5cfmh%download/URL-s250--https&c%%osf.io%5cfmh%download'

sappelhoff commented 5 years ago

that could happen if those keys have the same content, and the files indeed point to the same key for all the cases I've checked

Interesting, thanks for the detective work!

--fast buys you a content guarantee.

okay, that's something I should like. That also makes sense then why we don't see duplicates with --fast

sappelhoff commented 5 years ago

I think I found the error, why my CSV was never populated ...

It seems like this repo is MRI-centric and only .nii.gz files were expected to be loaded from OSF:

https://github.com/templateflow/datalad-osf/blob/42a2b934b7a94e9bb65885bf47ce3662db377246/datalad_osf/utils.py#L91-L96

That should be easy to fix!

templateflow / datalad-osf

Usage of pipeline ... and error when trying recursive update #1