phiweger / zoo

A portable datastructure for rapid prototyping in (viral) bioinformatics (under development).
5 stars 2 forks source link

data versioning and decay #63

Open phiweger opened 7 years ago

phiweger commented 7 years ago
  1. many of the ideas behind zoo have been around for some time in non-biology fields it seems (see link collection below)
  2. when versioning data, one quickly looks at a lot of (older versions of) data, with the history outgrowing the real dataset rather quickly - introduce data decay, say with t_50 of {1, 5, 10} years for historical records
  3. the spec we use for data cells is called ndjson apparently, for "newline-delimited"

links:

http://stackoverflow.com/questions/4185105/ways-to-implement-data-versioning-in-mongodb

https://github.com/thiloplanz/v7files/wiki/Vermongo

https://github.com/leeper/data-versioning

[...] note that the Open Knowledge Foundation seems to favor JSON as a storage mode).

https://specs.frictionlessdata.io/data-package/ https://github.com/frictionlessdata/datapackage-py http://frictionlessdata.io/tools/ https://github.com/frictionlessdata/dpm-js

https://github.com/frictionlessdata

http://guides.dataverse.org/en/latest/developers/unf/index.html

http://frictionlessdata.io/case-studies/dataship/

phiweger commented 7 years ago
  1. store only diffs in history, see e.g.

https://github.com/paulfitz/daff http://paulfitz.github.io/daff/

pip install daff
phiweger commented 7 years ago

ad 2. in money, this decay is also known as ("natural") money carrying a decay fee - the analogy in data would be usage, or "use it or loose it", meaning that if data is not used (= changed/ downloaded) we observe atrophy of "natural data"

a data cell/ package/ ... could then become a token on some public ledger - with this scheme, data would decay but could replicate (through people creating forks) just like all living things decay but replicate the information through offspring "forks"

phiweger commented 7 years ago

ad 3. a single observation is a line in a file. if this observation is changed, it becomes a new observation at should be appended to the file, instead of changed in place. however, with this (simple) model the file grows large quickly, so we need something like decay (2.) to take care of that

phiweger commented 7 years ago

ctbs thoughts on the subject: "How I learned to stop worrying and love the coming archivability crisis in scientific software"

My conclusion is that, on a decadal time scale, we cannot rely on software to run repeatably.

// see the other articles in the mini-series

acknowledge that exact repeatability has a half life of utility, and that this is OK.

I've only just started thinking about this in detail, but it is at least plausible to argue that we don't really care about our ability to exactly re-run a decade old computational analysis. What we do care about is our ability to figure out what was run and what the important decisions were -- something that Yolanda Gil refers to as "inspectability." But exact repeatability has a short shelf-life.

// concepts mentioned: reproduce != repeat, data implies software, remixing, half life of software (and data, see data implies software), inspectability, see here

down the rabbot hole: 10 aspects of highly efficient scientific data