monarch-initiative / dipper

Data Ingestion Pipeline for Monarch
https://dipper.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
57 stars 26 forks source link

Evaluate existing ongoing efforts for packaging/containerizing data #496

Open cmungall opened 7 years ago

cmungall commented 7 years ago

Not sure this belongs in dipper tracker, but for want of a better place.

There are some emerging efforts that aim to treat data as code. Common themes are:

Some of these are quite trivial; but you could say github is a trivial web interface on top of git but developers obviously love it

Of course this overlaps with https://www.w3.org/TR/hcls-dataset/ but AFAICT that hasn't taken off, there is no tooling associated with it.

It's also similar to my https://biodatasets.github.io/mybiocaddie/about/ project.

We should take a look at these with two perspectives

If the answer is yes, which improvements/changes would we want?

Frictionless

This seems fairly lightweight and open: http://frictionlessdata.io/

It doesn't provide any storage, it's just some standards about how you mark up and bundle a csv, and some simple tools to help bundle or consume bundles.

It all seems well thought out, but they have zero examples on the site which is frustrating. I want to be able to search for data packages. Of course this is harder as they don't centralize which is arguably good. More like git for data than github for data.

Seems to be managed by a non-profit.

csv and json seems to be privileged. But it seems RDF would also work.

Quilt

This is similar and seems to give you a bit of extra abstraction, but seems quite tied to python at the moment:

https://quiltdata.com/

It's kind of more like an npm for data

I made this package of some globi interactions: https://quiltdata.com/package/cmungall/dinosaur_biotic_interactions

seems much more centralizing

data.world

This is really slick, but requires logging in -- hmmm, looks like they are trying to make a giant silo they can monetize?

https://data.world

osfclient

"A scholarly commons to connect the entire research cycle"

seems a bit broader in scope, like github protocols.io everything rolled into one

pachyderm

This has an emphasis on containers and pipelines http://pachyderm.io/

less relevant from a dipper perspective but worth paying attention to

lwinfree commented 7 years ago

Have you looked at Dat yet? https://datproject.org/ I know the people/non-profit behind it.

cmungall commented 7 years ago

Good addition!

chrisgorgo commented 7 years ago

https://git-lfs.github.com/

http://datalad.org/

https://git-annex.branchable.com/

cmungall commented 7 years ago

Thanks!

I forgot to add github itself. We're very happy using github for data VCS and packaging for smaller artefacts like ontologies, but for files over the 100M limit we found git-lfs wanting.

hand't heard of datalad. Lots going on in this area