Closed sappelhoff closed 4 years ago
myself I haven't used datalad-osf yet. From traceback sounds like some empty row?
if you run with datalad --dbg
- you could get into pdb and troubleshoot details, or just check out that csv?
or just check out that csv?
there is no csv in my dataset :thinking:
BUT having talked to @jasmainak a bit it seems like my premise is wrong.
I thought I could create a git annex repo that would look JUST LIKE my real dataset, but instead of the real data, it would contain symbolic links pointing to the OSF data.
And then I would be able to host that git annex repo (very low size) on GitHub, allow people to pull it with datalad, and use datalad.api.get()
to download the data from OSF.
According to Mainak I would need my own git server to do something like that.
Apparently datalad_osf is just a Python API to download OSF files (the true files) and add them to a local git annex repo (with the full files).
According to Mainak I would need my own git server to do something like that.
I don't think so. git-annex will just contain urls pointing to OSF
Apparently datalad_osf is just a Python API to download OSF files (the true files) and add them to a local git annex repo (with the full files).
yeap, and then you can publish that repository to github, along with git-annex branch (datalad publish
does that) so anyone who clones it should be able to get actual files using git annex get
or datalad get
from OSF. So the premise is right as far as I see ;)
didn't look into anything else but just FYI that the fetched csv has only the header.
$> cat eeg_matchingpennies
name,url,location,sha256,path
as for datalad crashing instead of just silently exiting or issuing a warning that no records were received, I filed https://github.com/datalad/datalad/issues/3577
didn't look into anything else but just FYI that the fetched csv has only the header.
Mh, yes - this is a bug, also the test example from the main README fails, perhaps we should wait for @rciric to work this out.
In the meantime, do you have a pointer to docs / tutorials how to do what I want to (see above, my "premise") using just datalad?
In the meantime, do you have a pointer to docs / tutorials how to do what I want to (see above, my "premise") using just datalad?
I don't believe we have a high-level tutorial on addurls yet. But here's a quick example using a couple of the URLs from the OSF directory that you pointed to. This skips past the more involved task, which IIUC datalad-osf handles, of getting a set of stable URLs and putting them into either a .json or .csv file that addurls() understands.
Wow, this is really great @kyleam thanks!
This seems to have worked to a large extend! I have made a CSV file with my file-paths and urls: "mp.csv" and made a datalad set:
datalad create eeg_matchingpennies
datalad addurls mp.csv "{url}" "{fpath}" -d eeg_matchingpennies/
(Note: I did not commit the csv to the repo, because I thought it's not necessary)
There seems to be a bug however with some of the files:
cd eeg_matchingpennies
git annex whereis
for some files, this prints several links, all except one are wrong, E.g.:
whereis eeg_matchingpennies/sub-11/eeg/sub-11_task-matchingpennies_channels.tsv (1 copy)
00000000-0000-0000-0000-000000000001 -- web
web: https://osf.io/4safg/download
web: https://osf.io/5cfmh/download
web: https://osf.io/6p8vr/download
web: https://osf.io/nqjfm/download
web: https://osf.io/qvze6/download
ok
I checked the CSV file, and it does not seem to be the source of the error. Can either of you reproduce this error @yarikoptic @kyleam ?
Separate question: I continued as @kyleam suggested to make a local clone and remove the origin, to get a publishable git-annex dataset with only the "web" source of the data.
See: https://github.com/sappelhoff/bogus
apparently something went wrong - can you tell me what I should do?
After cloning and removing the origin, I did (with the clone):
git remote add origin https://github.com/sappelhoff/bogus
git push origin master
When I realized that this does not look right, I figured that datalad publish
might be the way to go, so I tried (on top of the previous steps):
datalad publish . --to origin --force
But all that gave me was a cryptic "git-annex" branch ...
I now want to use datalad install https://github.com/sappelhoff/bogus
, do I first have to merge the git-annex
branch into master
? Do I leave both branches untouched?
Is this the right way to go at all?
Just go ahead with datalad install https://github.com/sappelhoff/bogus
git-annex branch should never be merged into any normal branch. Leave it for git-annex to deal with
@sappelhoff:
for some files, this prints several links, all except one are wrong, E.g.: [...] I checked the CSV file, and it does not seem to be the source of the error. Can either of you reproduce this error @yarikoptic @kyleam ?
Hrm that's odd.
I tried with --fast
first, and all of the urls look ok on my end (i.e. I see only one web entry for each URL). Here's the one from the example:
$ git annex whereis eeg_matchingpennies/sub-11/eeg/sub-11_task-matchingpennies_channels.tsv
whereis eeg_matchingpennies/sub-11/eeg/sub-11_task-matchingpennies_channels.tsv (1 copy)
00000000-0000-0000-0000-000000000001 -- web
web: https://osf.io/6p8vr/download
ok
I'm trying now without --fast
.
I'm running this with datalad 0.11.6 and git-annex 7.20190730+git2-ga63bf35dc-1~ndall+1 on GNU/Linux. What's your version info?
Thanks Yaroslav, I'll try that later!
@kyleam I am using:
pip install -e.
from my clone of master
)good to hear that it works with --fast
... I am excited what you'll see without it.
However, reading what --fast
does, I should perhaps used that in the first place, because I am later on purging the local data anyhow :-)
good to hear that it works with
--fast
... I am excited what you'll see without it.
Without --fast
I see repeats, including the example you point to:
$ git annex whereis eeg_matchingpennies/sub-11/eeg/sub-11_task-matchingpennies_channels.tsv
whereis eeg_matchingpennies/sub-11/eeg/sub-11_task-matchingpennies_channels.tsv (2 copies)
00000000-0000-0000-0000-000000000001 -- web
24081d41-a5ee-434b-a58a-4401106dc189 -- foo [here]
web: https://osf.io/4safg/download
web: https://osf.io/5cfmh/download
web: https://osf.io/6p8vr/download
web: https://osf.io/nqjfm/download
web: https://osf.io/qvze6/download
ok
It seems there has to be something going wrong in the underlying git annex addurl --batch
call, but I don't know whether it's on our end (in AnnexRepo, not addurls.py) or git-annex's. Some time next week I'll try to see if I can trigger the issue using git-annex directly.
Aah, it should've occurred to me sooner, but that could happen if those keys have the same content, and the files indeed point to the same key for all the cases I've checked. So I think things are working as expected.
@sappelhoff:
However, reading what
--fast
does, I should perhaps used that in the first place, because I am later on purging the local data anyhow :-)
It's more expensive, but leaving out --fast buys you a content guarantee. With --fast, future downloads will only verify that the file has the expected size.
You can see this difference by looking at the link targets. Without --fast, you get a file that points to the key generated from the file's content:
test3 -> .git/annex/objects/wj/6x/SHA256E-s250--dd8[...]7a0/SHA256E-s250--dd8[...]7a0
With --fast, the target only encodes the size:
test4 -> '.git/annex/objects/81/K7/URL-s250--https&c%%osf.io%5cfmh%download/URL-s250--https&c%%osf.io%5cfmh%download'
that could happen if those keys have the same content, and the files indeed point to the same key for all the cases I've checked
Interesting, thanks for the detective work!
--fast buys you a content guarantee.
okay, that's something I should like. That also makes sense then why we don't see duplicates with --fast
I think I found the error, why my CSV was never populated ...
It seems like this repo is MRI-centric and only .nii.gz
files were expected to be loaded from OSF:
That should be easy to fix!
Hi, I am trying to turn an OSF directory into a git annex repository and datalad-osf seems to be great for this.
I am not entirely sure whether this would work as I think it would, basically I expect:
datalad create
update_recursive
with my OSF key and the directory I want to be git annexeddatalad install
andget
data (that is stored on OSF, but indexed in my git annex repository)Can someone tell me whether I am just completely misunderstanding / misusing the pipeline? @yarikoptic @rciric
Is there a simpler way to achieve what I want?
Apart from this, this is also a bug report. Here are the steps to reproduce:
mkdir mystuff
cd mystuff
datalad create
into
mystuff
, put the following Python filetry.py
mystuff
runpython try.py
... after a considerable amount of time, this provides me with the following error message: