ropensci / unconf18

http://unconf18.ropensci.org/
44 stars 4 forks source link

low-friction private data share & data publication #51

Open cboettig opened 6 years ago

cboettig commented 6 years ago

I'd love to have a robust and simple way to deal with data associated with a project.

For individual data files < 50 Mb, I have bliss. I can commit these files to a private GitHub repo with ease; they are automatically available to Travis for any checks (no encoding private credentials); I can flip a switch and get a DOI for every new release once I make my data public.

Or as another concrete example: my students are all asking me how to get there ~ 200MB spatial data files onto the private Travis builds we use in class.

For larger data files, life is not so sweet. The alternatives and their pitfalls, as I see them:

Other scientific repositories with less ideal solutions:

Things that might be strategies but somehow never work well for me in this context:

amoeba commented 6 years ago

Great idea! What about encrypted Dat?

This would mean that CI systems would download data from peers rather than fast CDNs like with S3/GitHub/etc which would mean slow builds for some proportion of users.

Pakillo commented 6 years ago

Apart from Dat, another option could be OSF: http://journals.sagepub.com/doi/full/10.1177/2515245918757689

khondula commented 6 years ago

Another option to be aware of - I think that Dataverse repositories can accept individual files up to 2.5 GB and datasets up to 10 GB (according to Nature but also the docs say "file upload limit size varies by installation"). Anyone can submit to the flagship Harvard Dataverse or institutions can set up their own repositories. There is some discussion of setting dataset and file level permissions here.

sckott commented 6 years ago

I like the idea of dat, though It's not clear what the status is of the project (seems all the main people have left?). Are there any similar projects?

I'd think we should steer away from figshare as they haven't really been supporting their API

One thing that came to mind was http://academictorrents.com/ - though seems very tenuous, and login doesn't have https, scary. There was also biotorrents, but even the domain name is gone now. Anyway, in general, perhaps a torrent solution could help in this space. Though I imagine many universities would by default block any torrent use from IP addresses coming from their members

noamross commented 6 years ago

It seems to me that the way to go with these is as common a front-facing API as possible with back-ends across multiple repositories which would handle DOI-->file navigation. Back-end repositories have their pros and cons in terms of private/public, file size, whether they have institutional archival arrangements, need for local software installation, DOI availability and reservation, etc. It would be a good start to tabulate these so that people can know about them, and then prioritize those to focus back-end development on. I think OSF checks most of the boxes, but people will differ. This could work for even some more stringent/specialized archives like KNB/DataONE.

The front-end might be datastorr like in functionality and maybe API, but not be tied to GitHub. You stash your API key or even your preferred back-end as environment variables, and then data_release(), data_update(), dataset(ID,...),set_public(),set_private()`, etc.

I'm not sure why the fact that they're enterprise-focused is a reason to avoid figshare. If lots of institutions use them that's good. Amazon is sure enterprise focused! If I recall development for rfigshare halted a bit because their API was unstable at one point, but they cover a lot of the bases.

Re: Dataverse, which also seems good, I note that Thomas Leeper is looking for a new maintainer for the dataverse R package

noamross commented 6 years ago

Another good thing to assess for all the back-ends usage of common metadata standards - both for ease of development and long-term sustainability and compatibility across services.

mpadge commented 6 years ago

So I've been talking to some sociologist friends about this, and they share a major concern that is not unique:

  1. Data sets are (often, and for them, almost always) very expensive to collect.
  2. Many funding agencies and/or journals now require data sets to be openly published.
  3. This mean that they in effect get one go and a good paper from their data before it's released for general plundering and pillage, ultimately negatively impacting their research

A solution we discussed is a means of tracking and thereby always reliably ascribing data provenance, in effect ensuring that people otherwise suffering such effects would automatically be listed as authors of any papers using their data. And so...

A solution

A tangle (see whitepaper here) potentially offers a perfect system for tracking data provenance.

An unconf proposal

Software to constuct/modify tangle technology specifically for the sharing of data to ensure maintenance and attribution of provenance. Sharing/accessing a data set simply becomes a tangle transaction. Advantages:

  1. Obviates any need for most of the above discussions because data access is P2P
  2. Meta-data on provenance always maintained
  3. Generators of data can always inquire about copies of their data, and/or standards can readily be set through ensuring citation refers to a tangle transaction.
  4. The whole system is a graph, so (not too far down the track), the whole tangle of data meta-data will be able to be searched with graphql, offering a big boost to #26.
cboettig commented 6 years ago

All great ideas here. I really like the approach @noamross outlines of identifying some core functionality that could be expressed in a generic front-end package, that could allow the user to swap in there preferred 'back-end' storage choice, whether it's a DOI-providing registry like Zenodo, a paid service like S3, or Blockchains-take-over-the-world thing.

karawoo commented 6 years ago

Just catching up on the issues now. To add to @cboettig's list of options, there's also Synapse, which is a project of my employer. Free, allows private sharing and public-ish sharing (downloading public data still requires a Synapse account), supports simple versioning and provenance, DOIs, etc. Because much of our data relates to human health, users must pass a quiz on responsible data governance to become "certified users" before they can upload data.