monarch-initiative / dipper

Data Ingestion Pipeline for Monarch
https://dipper.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
57 stars 26 forks source link

Add a generic way of consuming SmartBags #579

Open cmungall opened 6 years ago

cmungall commented 6 years ago

BagIt provides a convenient way to provide a bunch of files in a bundle, with minimal metadata, and a means of controlling which files get downloaded. If all files were released this way then it would avoid the need to hardcode the manifest in dipper modules.

SmartBag extends this with richer metadata and a way to provide a data-dictionary for any TSVs in the release, using JSON-LD. This would avoid the need for hardcoding info about column headers or ordering.

In fact, it may even be feasible to extend smartBag to include a SPARQL query that will transform the RDF-ified TSV into a properly nested RDF structure, following the OBAN model or otherwise.

TomConlin commented 6 years ago

Without data type and language tag discipline I have yet to see in both our OWL and RDF I do not think SPARQL is going to be our new friend.

By the spec "foo" does not match "foo"@en

nor does

"foo" match with "foo"^^<http://www.w3.org/2001/XMLSchema#string>

nor

"foo"@en match "foo"^^<http://www.w3.org/2001/XMLSchema#string>

Two literals (for example one in the triple store and one in our sparql query) are equal if and only if all of the following hold:

So without knowing in advance what you are going to find w.r.t. datatype & lang you cannot be sure it is absent without exhausting every combination. when you do get results back from one choice you can not be sure they are not also filtered by happenstance.

And we are in the lucky position of owning the datastores we are querying, queries against remote stores we can't know... yikes

By blind luck and being lazy I avoided this previously just by never using either.