traitecoevo / data_versioning

An approach for practical and simple data versioning in R
10 stars 1 forks source link

Figure: illustrates semantic versioning #10

Closed dfalster closed 7 years ago

dfalster commented 7 years ago

This fig should do two things, explain the numbers:

And also show series of releases, e.g.

wcornwell commented 7 years ago

I like the top one as a figure; bottom one is too git-y for a general audience?

wcornwell commented 7 years ago

not sure if this is #10 or #11

image

wcornwell commented 7 years ago

updated version:

image

dfalster commented 7 years ago
wcornwell commented 7 years ago

add some text in "A lightweight, cheap and scalable workflow for versioned data" section

wcornwell commented 7 years ago

image

dfalster commented 7 years ago

Nice!

wcornwell commented 7 years ago

still need to add dois:

some discussion of versioned dois here: https://blogs.openaire.eu/?p=2010

wcornwell commented 7 years ago

figshare disagrees: https://support.figshare.com/support/solutions/articles/6000079064-can-i-edit-or-delete-my-research-after-it-has-been-made-public-

hmmmm

wcornwell commented 7 years ago

from the zenodo people:

Why don’t the DOIs have a version number suffix like “.v1”?

Including semantic information such as the version number in a DOI is bad practice, because this information may change over time, while DOIs must remain persistent and should not change.

Moveover, Zenodo DOI versioning is linear, which means that the Zenodo version number may in fact not be the real version number of the resource. Take for instance software, where it is common practice to have dot versions and make new releases in a non-linear order (e.g. first v1.0, then v1.1, then v2.0, then v1.2).

The versioning suffix is also not a functionality of the DOI system, i.e. adding .v2 to DOI will not resolve to version 2 of a resource for any DOI from any provider. Different providers also uses different patterns such as e.g. .v2, .2, /2.

Most importantly, version suffixes are not machine readable. A discovery system that understands DOIs, will not know that .v1 and .v2 of a DOI are in fact two versions of the same resource.

A better solution to this problem is to semantically link two DOIs in the metadata of a DOI. This ensures that discovery systems have a machine readable way to discover that two DOIs are versions of the same resource.

See also “Cool DOIs”, a blog post by Martin Fenner, DataCite Technical Director: https://doi.org/10.5438/55E5-T5C0

wcornwell commented 7 years ago

image

cboettig commented 7 years ago

Also check out http://blog.zenodo.org/2017/05/30/doi-versioning-launched/ , and more details at: http://help.zenodo.org/#versioning (which now I see you've already quoted above)

Recall that if you turn the Zenodo integration switch on (much like a travis switch), zenodo will automatically assign a (now versioned) DOI to each github release, which I think works super nicely with the datastorr package since there's no need for the user to do anything on Zenodo or GitHub or anything, adding a tag on datastorr would give you the DOI automatically.

Zenodo has had this behavior for a while, but the versioning now means that (a) the DOIs are numbered sequentially, and (b) most importantly, if you go to one DOI you get notified about the DOIs of newer and older versions. You also get a DOI that always points to the most recent version.
It's this grouping and seamless automation to get the new DOIs is really key, and the semantic debate is more of a distraction one way or another. (e.g. perfectly fine to have semantic-versioned github tags/releases that are archived on zenodo with it's sequential but not semantic DOIs).

dfalster commented 7 years ago

Interesting debate! Thanks for sharing links and thoughts @cboettig @wcornwell. The post on cool dois was super useful. I was taken by the argument that dois should not contain semantic information.

@wcornwell: in your graphic perhaps change the top line to "

Some points I mentioned to Will last week

One problem with using zenodo and datastorr is that it doesn't allow for build artefacts to be archived, only the "code" part of the GH repo. This means it doesn't always work for datastorr, for example in the BAAD the final complied dataset is not under version control, but is instead built (like a binary) from its parts and uploaded to the GH release. I could change this so that I include the built artefact under version control as well. Or we recommend anyone using that type of workflow use another provider.

I imagine @richfitz would be against the built object under version control, as it's wasteful and counter to common practice. But I can see some argument in it, as you then be able to see the changes in the commit history on the final dataset, as well as the raw "code". Thoughts?

cboettig commented 7 years ago

@dfalster great point on archiving of built objects, see https://github.com/zenodo/zenodo/issues/1235