low-friction private data share & data publication

cboettig commented 6 years ago

I'd love to have a robust and simple way to deal with data associated with a project.

For individual data files < 50 Mb, I have bliss. I can commit these files to a private GitHub repo with ease; they are automatically available to Travis for any checks (no encoding private credentials); I can flip a switch and get a DOI for every new release once I make my data public.

Or as another concrete example: my students are all asking me how to get there ~ 200MB spatial data files onto the private Travis builds we use in class.

For larger data files, life is not so sweet. The alternatives and their pitfalls, as I see them:

Amazon S3. Setup is far less trivial. Can be expensive if lots of people download my large files (maybe not an issue if it's only for private data). Working on travis requires encrypting keys. No convenient button to press to make this public with DOI when ready. (though could manually upload to Zenodo). Ability to directly access individual files (by URL).
datastorrr. Nearly perfect for data < 2 GB; (adds data as "attachments" to GitHub releases, which aren't version controlled. Would love to see the branch that supports private authentication merged and a preliminary CRAN release. Maybe good fodder for Unconf?
Git LFS: Closest to my workflow for small data, but GitHub's pricing model basically renders this unworkable. (also no idea if Zenodo import captures LFS files). @jimhester posted a brilliant work-around for this at https://github.com/jimhester/test-glfs using GitLab for the LFS side to store the large data files of a repo on GitHub (up to 10 GB), but I could never get this working myself. (Would love to get unstuck).

Other scientific repositories with less ideal solutions:

zenodo. Zenodo supports direct uploads with up to 50 GB of data, making it a great option for easy public data sharing. No private option, no ability to download directly from DOI address.
figshare allows for private sharing and public sharing, DOIs for public data. Not sure file limits. rfigshare package not actively maintained... No ability to download directly from DOI address
DataONE Allows private and public sharing, supports ORCID auth, rich metadata model (burdensome to enter at first but could be useful with better tooling). Requires re-authenticating with time-expiring token. provides DOIs and other identifiers. No ability to download directly from DOI address, but does support ability to access individual files without downloading entire archive...
... more / other related strategies?

Things that might be strategies but somehow never work well for me in this context:

Box / Dropbox
Google Drive

amoeba commented 6 years ago

Great idea! What about encrypted Dat?

This would mean that CI systems would download data from peers rather than fast CDNs like with S3/GitHub/etc which would mean slow builds for some proportion of users.

Pakillo commented 6 years ago

Apart from Dat, another option could be OSF: http://journals.sagepub.com/doi/full/10.1177/2515245918757689

khondula commented 6 years ago

Another option to be aware of - I think that Dataverse repositories can accept individual files up to 2.5 GB and datasets up to 10 GB (according to Nature but also the docs say "file upload limit size varies by installation"). Anyone can submit to the flagship Harvard Dataverse or institutions can set up their own repositories. There is some discussion of setting dataset and file level permissions here.

sckott commented 6 years ago

I like the idea of dat, though It's not clear what the status is of the project (seems all the main people have left?). Are there any similar projects?

I'd think we should steer away from figshare as they haven't really been supporting their API

One thing that came to mind was http://academictorrents.com/ - though seems very tenuous, and login doesn't have https, scary. There was also biotorrents, but even the domain name is gone now. Anyway, in general, perhaps a torrent solution could help in this space. Though I imagine many universities would by default block any torrent use from IP addresses coming from their members

noamross commented 6 years ago

It seems to me that the way to go with these is as common a front-facing API as possible with back-ends across multiple repositories which would handle DOI-->file navigation. Back-end repositories have their pros and cons in terms of private/public, file size, whether they have institutional archival arrangements, need for local software installation, DOI availability and reservation, etc. It would be a good start to tabulate these so that people can know about them, and then prioritize those to focus back-end development on. I think OSF checks most of the boxes, but people will differ. This could work for even some more stringent/specialized archives like KNB/DataONE.

The front-end might be datastorr like in functionality and maybe API, but not be tied to GitHub. You stash your API key or even your preferred back-end as environment variables, and then data_release(), data_update(), dataset(ID,...),set_public(),set_private()`, etc.

I'm not sure why the fact that they're enterprise-focused is a reason to avoid figshare. If lots of institutions use them that's good. Amazon is sure enterprise focused! If I recall development for rfigshare halted a bit because their API was unstable at one point, but they cover a lot of the bases.

Re: Dataverse, which also seems good, I note that Thomas Leeper is looking for a new maintainer for the dataverse R package

noamross commented 6 years ago

Another good thing to assess for all the back-ends usage of common metadata standards - both for ease of development and long-term sustainability and compatibility across services.

mpadge commented 6 years ago

So I've been talking to some sociologist friends about this, and they share a major concern that is not unique:

Data sets are (often, and for them, almost always) very expensive to collect.
Many funding agencies and/or journals now require data sets to be openly published.
This mean that they in effect get one go and a good paper from their data before it's released for general plundering and pillage, ultimately negatively impacting their research

A solution we discussed is a means of tracking and thereby always reliably ascribing data provenance, in effect ensuring that people otherwise suffering such effects would automatically be listed as authors of any papers using their data. And so...

A solution

A tangle (see whitepaper here) potentially offers a perfect system for tracking data provenance.

An unconf proposal

Software to constuct/modify tangle technology specifically for the sharing of data to ensure maintenance and attribution of provenance. Sharing/accessing a data set simply becomes a tangle transaction. Advantages:

Obviates any need for most of the above discussions because data access is P2P
Meta-data on provenance always maintained
Generators of data can always inquire about copies of their data, and/or standards can readily be set through ensuring citation refers to a tangle transaction.
The whole system is a graph, so (not too far down the track), the whole tangle of data meta-data will be able to be searched with graphql, offering a big boost to #26.

cboettig commented 6 years ago

All great ideas here. I really like the approach @noamross outlines of identifying some core functionality that could be expressed in a generic front-end package, that could allow the user to swap in there preferred 'back-end' storage choice, whether it's a DOI-providing registry like Zenodo, a paid service like S3, or Blockchains-take-over-the-world thing.

karawoo commented 6 years ago

Just catching up on the issues now. To add to @cboettig's list of options, there's also Synapse, which is a project of my employer. Free, allows private sharing and public-ish sharing (downloading public data still requires a Synapse account), supports simple versioning and provenance, DOIs, etc. Because much of our data relates to human health, users must pass a quiz on responsible data governance to become "certified users" before they can upload data.

ropensci / unconf18

low-friction private data share & data publication #51

A solution

An unconf proposal