monarch-initiative / dipper

Data Ingestion Pipeline for Monarch
https://dipper.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
57 stars 26 forks source link

extend provenance/dataset descriptions #22

Closed nlwashington closed 4 years ago

nlwashington commented 9 years ago

Dipper currently has a very basic description created for each datasource, that produces something like:

<:biogrid3.2.119> dct:isVersionOf <:biogrid> ; pav:version "3.2.119" .

<:biogrid> a dctypes:Dataset ; dct:identifier "biogrid" ; dct:issued "2014-12-23" ; dct:title "The BioGrid" ; dcat:accessURL http://thebiogrid.org/downloads/archives/Latest%20Release/BIOGRID-ALL-LATEST.mitab.zip, http://thebiogrid.org/downloads/archives/Latest%20Release/BIOGRID-ALL-LATEST.tab.zip ; foaf:page http://thebiogrid.org/ .

We need to document and implement requirements for our system, as in this example: http://htmlpreview.github.io/?https://github.com/joejimbo/HCLSDatasetDescriptions/blob/master/Overview.html#appendix_1

BioGrid is a good example of a resource that also aggregates information from other sources (for example, in addition to curating their own data, they also pull in interaction data from flybase). For full tracking of that provenance chain, it would be good to mock up what the dataset description should look like here. @mbrush after mocking that up, turn the ticket over to @kshefchek or @bryanlaraway to implement the relevant methods in Dataset.py, and in turn apply them to the various sources currently in Dipper.

mbrush commented 9 years ago

Following from March 4 conversation (MH, MB, BL, KS, NW):

  1. Dataset descriptions are for the datasets we produce after ingesting and transforming source data into rdf. In HCLS terms, these datasets are distribution level artifacts (as opposed to version level of summary level datasets). Therefore we should follow HCLS guidelines for what metadata must/should/may/shouldnever be included at this level. And we should link the node representing the dataset to a version level node and a summary level node.

Note this is different than the current implementation where monarch datasets are version level artifacts.

  1. For April milestone, be sure all 'MUSTs' from the HCLS standard are implemented (and an automated test implemented to evaluate conformance)

Should we also create version and summary level descriptions for each dataset (including all the MUSTS?)

  1. Be sure to describe provenance of the data - linking to source it was derived from, and if this source was an aggregator, consider how to trace back to the sources for the aggregator.
  2. Hang contributors from appropriate dataset nodes, and consider how these will be rolled up from distribution/version level to summary level node.
mbrush commented 9 years ago

Posting cmap specification for the initial dataset descriptions in the Dropbox here for review, and pasting in copy below.

hcls compliance musts

Briefly, Summary, Version, and Distribution level instances are created for each dataset, according to the HCLS standard. There is also a version level instance created for the source dataset from which the Monarch dataset was derived. And if applicable, there are version level instances for any datasets from which this source dataset was derived (to support full provenance of information). This spec is designed to address all of the MUSTS in the HCLS standard for each Monarch dataset (summary, version, and distribution). These are the properties in blue. There are a few additional properties that are not MUSTS which we include as well.

There are some open questions posted below. These mostly concern patterns for IRIs, identifiers, and titles for datasets, and entities referenced in their descriptions (e.g. publishers, licenses formats). @cmungall and @mellybelly please weigh in on these. A short call may be needed to discuss options.

1 We need a standard approach for crafting IRIs, identifiers, and titles for Monarch datasets. An example proposal is below.

Dataset IRIs: 'http://www.monarchinitiative.org/'[source][version][format]. Summary - http://www.monarchinitiative.org/biogrid Version - http://www.monarchinitiative.org/biogrid1.0 Distribution - http://www.monarchinitiative.org/biogrid1.0ttl

Essentially IRIs would use default/base monarch namespace appended with source, as well as a version for version level, and a format for the distribution level.

Dataset Identifiers: 'monarch:'[source][version][format] Summary - "monarch:biogrid" Version - "monarch:biogrid1.0" Distribution - "monarch:biogrid1.0.ttl"

Consider using prefixed IRIs for Identifiers - all lower case, with no spaces between lexical units. Note that identifiers are not HCLS MUSTS (so perhaps no include yet).

Dataset Titles: 'Monarch' [source] [version] [format] [short descriptor] Summary - "Monarch Biogrid Interactions" Version - "Monarch Biogrid 1.0" Distribution - "Monarch Biogrid 1.0 ttl"

Titles capitalized, with spaces between lexical units.

2 IRIs for source datasets Related to the issue above, what IRIs should we use for source datasets from which monarch datasets are derived? e.g. biogridx.y.z. Should we just use / create identifiers.org based IRIs here?

3 IRIs for entities referenced in dataset descriptions (creators, publishers, licenses, formats) The HCLS standard either requires or prefers IRIs as values for these attributes of a dataset. So we should think about how we might find or create IRIs for organizations, licenses, and formats. Is there a source of existing IRIs we want to use for these things (e.g. idntifiers.org)? If not, can we use a website URL as our IRI (the HCLS chembl example does this)? Or should we mint IRIs ourselves where we cant find them? It may be that we can punt on this for now and just record these as literals rather than IRIs.

4 Defining a source dataset instance We will need to decide what constitutes an instance of a source dataset, likely on a case-by-case basis. In our 4-29-15 DIPper meeting, we decided that in all cases we should have a dataset that represents all data from a given source. But in cases where this dataset is comprised of many subsets of data (tables, files, etc) that have different topics or different release cycles, we should create dataset instances representing subsets of the full dataset, and annotate them with information specific to them (e.g. date issued, version).

5 License for our Monarch datasets We need to decide what license to release these under.

micheldumontier commented 9 years ago

:)

cmungall commented 9 years ago

I know it's just an example but we should think about the URL structure. cc @kltm

http://www.monarchinitiative.org/biogrid ==> http://data.monarchinitiative.org/derived?/biogrid

"data.mi.org" would map to nif-crawler. But note the dir layout we have currently on that machine isn't the same. We should align these sooner rather than later for general sanity.

How attached are you to [source][version][format] Should be ".[format]"

I am partial to having current vs archive separated

ShahimEssaid commented 9 years ago

Matt asked me to look at this issue, specifically around IRI/URL patterns, and issues. I have a lot to say but I shouldn't do it here. My personal preferences are (and I apologize in advance if none of this appears to makes any sense):

  1. If we create new URIs for the sole purpose of identification, we should create them under a URI scheme that is not a resolvable scheme. Using the "http" URI scheme, which is one of the resolvable schemes (others are https, ftp, etc.), leads to the abuse of identity vs location.
  2. See this for an RFC based solution to address this problem: https://tools.ietf.org/html/rfc4151
  3. The, we need to communicate identities to specific locations to be able to get some results form some location based on some identity.
  4. To address 3, we would tend embed the identifier in query parameters, as Chris is suggesting in his alternative URL for. Chris, I'm curious about why you suggested this change? I like it but the "/biogrid" identifier still has no context and it is not a global identifier, and it depends on the previous ULS for global identity. I would prefer to have a tag-based value as the query parameter, and name the query parameter, something like "id=tag" where "tag" is a tag based URI.

The advantage of the above approach is that is enables this:

Let's say we have something we identify according to the tag RFC, and le't abbreviate it as A. we can then say:

http://some-server/some/rest/servcice?id=A http://some-other-server/some/rest/servcice?id=A etc.

Basically, you can ask any location for any data that is related to some global identity.

Another related link: http://www.persid.org/initiative.html

I might be a little to passionate about this but if we don't start to separate identifiers and locations, model identity beyond a simple URIs (to accommodate multiple identities, which might not be URI based, without having to rely on sameas, etc.), etc. our solutions will keep suffering. The worst case I have seen is the VIVO approach for identifying instances, and this is why I am interested the solution outlined above.

mbrush commented 9 years ago

We discussed many of the issues above on the May 14 UI call. Decisions and open proposals from this call are summarized in the google doc here.

Please review and comment here, as feedback on the specific pattern proposals are more practical in a google doc. The final decisions can be documented in this ticket when complete.

ccondit commented 9 years ago

Related to SciGraph/SciGraph#106 - would be great to have ontology IRIs for dipper generated sources now that 'isDefinedBy' is used to link axioms back to the source ontology. Will generate a node based on the object hash code for now.

nlwashington commented 9 years ago

@jnguyenx might have things to say about this.

jnguyenx commented 9 years ago

My approach might be too naive, but here's a proposal to version the ttl files.

When I see IRI's and the need of versioning, the first thing I think of is how application documentations are versioned. Applications keep track of the version of the documentation directly in the URL, such as myapp/docs/${version}. There's also a special link to always point to the latest version of the documentation: myapp/docs/latest. I think that this became a standard over the years. Here's an example: http://spark.apache.org/docs/1.4.1/index.html http://spark.apache.org/docs/1.3.1/index.html http://spark.apache.org/docs/latest/index.html

We can also have more fine grained reference, like only care of major and minor version, not tiny: https://www.playframework.com/documentation/2.4.x/Home

It it clearly a pain to maintain all the dependencies versions by hand. One strategy would be to always point to the latest versions when developing, and for a release to resolve the versions to actual numbers. With this we're sure to have a consistent set of data with a proper tag that can be used to reproduce builds and so on.