monarch-initiative / dipper

Data Ingestion Pipeline for Monarch
https://dipper.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
57 stars 26 forks source link

Consider releasing triples as BDBags or BagIt-RO or SmartBags #551

Open cmungall opened 6 years ago

cmungall commented 6 years ago

Currently we release all dipper-generated ttl into a folder, with the convention of 2-3 files per source (data, metadata, sample). Metadata follows W3C HCLS conventions.

Consider releasing using a standard bundle format, e.g.

Could be at the level of datasets as well as aggregate.

Standards like these are likely to be adopted in the context of projects like the NIH Data Commons. But there may be some advantages for us too. The checksum is good practice. Having an explicit manifest is good rather than relying on curling the whole directory and parsing the list based on string matching. It may also be good for larger datasets where we want to take advantage of commons cloud infrastructure e.g to make available on multiple clouds.

Consider also including explicit provenance that documents how we generated our triples from what source. E.g. https://github.com/ResearchObject/bagit-ro/blob/master/example1/metadata/provenance/results.prov.jsonld

I'm also probing the idea of doing OBO ontology releases the same way

@kltm @dougli1sqrd something to consider for GO?

cmungall commented 6 years ago

Update: we should use @stevencox/Helium's smartBag here:

https://github.com/NCATS-Tangerine/smartBag

In fact we should have a nimble way of consuming smartBags in dipper too. I will file a separate ticket for this.