Closed nlwashington closed 4 years ago
Following from March 4 conversation (MH, MB, BL, KS, NW):
Note this is different than the current implementation where monarch datasets are version level artifacts.
Should we also create version and summary level descriptions for each dataset (including all the MUSTS?)
Posting cmap specification for the initial dataset descriptions in the Dropbox here for review, and pasting in copy below.
Briefly, Summary, Version, and Distribution level instances are created for each dataset, according to the HCLS standard. There is also a version level instance created for the source dataset from which the Monarch dataset was derived. And if applicable, there are version level instances for any datasets from which this source dataset was derived (to support full provenance of information). This spec is designed to address all of the MUSTS in the HCLS standard for each Monarch dataset (summary, version, and distribution). These are the properties in blue. There are a few additional properties that are not MUSTS which we include as well.
There are some open questions posted below. These mostly concern patterns for IRIs, identifiers, and titles for datasets, and entities referenced in their descriptions (e.g. publishers, licenses formats). @cmungall and @mellybelly please weigh in on these. A short call may be needed to discuss options.
1 We need a standard approach for crafting IRIs, identifiers, and titles for Monarch datasets. An example proposal is below.
Dataset IRIs: 'http://www.monarchinitiative.org/'[source][version][format]. Summary - http://www.monarchinitiative.org/biogrid Version - http://www.monarchinitiative.org/biogrid1.0 Distribution - http://www.monarchinitiative.org/biogrid1.0ttl
Essentially IRIs would use default/base monarch namespace appended with source, as well as a version for version level, and a format for the distribution level.
Dataset Identifiers: 'monarch:'[source][version][format] Summary - "monarch:biogrid" Version - "monarch:biogrid1.0" Distribution - "monarch:biogrid1.0.ttl"
Consider using prefixed IRIs for Identifiers - all lower case, with no spaces between lexical units. Note that identifiers are not HCLS MUSTS (so perhaps no include yet).
Dataset Titles: 'Monarch' [source] [version] [format] [short descriptor] Summary - "Monarch Biogrid Interactions" Version - "Monarch Biogrid 1.0" Distribution - "Monarch Biogrid 1.0 ttl"
Titles capitalized, with spaces between lexical units.
2 IRIs for source datasets Related to the issue above, what IRIs should we use for source datasets from which monarch datasets are derived? e.g. biogridx.y.z. Should we just use / create identifiers.org based IRIs here?
3 IRIs for entities referenced in dataset descriptions (creators, publishers, licenses, formats) The HCLS standard either requires or prefers IRIs as values for these attributes of a dataset. So we should think about how we might find or create IRIs for organizations, licenses, and formats. Is there a source of existing IRIs we want to use for these things (e.g. idntifiers.org)? If not, can we use a website URL as our IRI (the HCLS chembl example does this)? Or should we mint IRIs ourselves where we cant find them? It may be that we can punt on this for now and just record these as literals rather than IRIs.
4 Defining a source dataset instance We will need to decide what constitutes an instance of a source dataset, likely on a case-by-case basis. In our 4-29-15 DIPper meeting, we decided that in all cases we should have a dataset that represents all data from a given source. But in cases where this dataset is comprised of many subsets of data (tables, files, etc) that have different topics or different release cycles, we should create dataset instances representing subsets of the full dataset, and annotate them with information specific to them (e.g. date issued, version).
5 License for our Monarch datasets We need to decide what license to release these under.
:)
I know it's just an example but we should think about the URL structure. cc @kltm
http://www.monarchinitiative.org/biogrid ==> http://data.monarchinitiative.org/derived?/biogrid
"data.mi.org" would map to nif-crawler. But note the dir layout we have currently on that machine isn't the same. We should align these sooner rather than later for general sanity.
How attached are you to
[source][version][format]
Should be ".[format]"
I am partial to having current vs archive separated
Matt asked me to look at this issue, specifically around IRI/URL patterns, and issues. I have a lot to say but I shouldn't do it here. My personal preferences are (and I apologize in advance if none of this appears to makes any sense):
The advantage of the above approach is that is enables this:
Let's say we have something we identify according to the tag RFC, and le't abbreviate it as A. we can then say:
http://some-server/some/rest/servcice?id=A http://some-other-server/some/rest/servcice?id=A etc.
Basically, you can ask any location for any data that is related to some global identity.
Another related link: http://www.persid.org/initiative.html
I might be a little to passionate about this but if we don't start to separate identifiers and locations, model identity beyond a simple URIs (to accommodate multiple identities, which might not be URI based, without having to rely on sameas, etc.), etc. our solutions will keep suffering. The worst case I have seen is the VIVO approach for identifying instances, and this is why I am interested the solution outlined above.
We discussed many of the issues above on the May 14 UI call. Decisions and open proposals from this call are summarized in the google doc here.
Please review and comment here, as feedback on the specific pattern proposals are more practical in a google doc. The final decisions can be documented in this ticket when complete.
Related to SciGraph/SciGraph#106 - would be great to have ontology IRIs for dipper generated sources now that 'isDefinedBy' is used to link axioms back to the source ontology. Will generate a node based on the object hash code for now.
@jnguyenx might have things to say about this.
My approach might be too naive, but here's a proposal to version the ttl files.
When I see IRI's and the need of versioning, the first thing I think of is how application documentations are versioned. Applications keep track of the version of the documentation directly in the URL, such as myapp/docs/${version}. There's also a special link to always point to the latest version of the documentation: myapp/docs/latest. I think that this became a standard over the years. Here's an example: http://spark.apache.org/docs/1.4.1/index.html http://spark.apache.org/docs/1.3.1/index.html http://spark.apache.org/docs/latest/index.html
We can also have more fine grained reference, like only care of major and minor version, not tiny: https://www.playframework.com/documentation/2.4.x/Home
It it clearly a pain to maintain all the dependencies versions by hand. One strategy would be to always point to the latest versions when developing, and for a release to resolve the versions to actual numbers. With this we're sure to have a consistent set of data with a proper tag that can be used to reproduce builds and so on.
Dipper currently has a very basic description created for each datasource, that produces something like:
<:biogrid3.2.119> dct:isVersionOf <:biogrid> ; pav:version "3.2.119" .
<:biogrid> a dctypes:Dataset ; dct:identifier "biogrid" ; dct:issued "2014-12-23" ; dct:title "The BioGrid" ; dcat:accessURL http://thebiogrid.org/downloads/archives/Latest%20Release/BIOGRID-ALL-LATEST.mitab.zip, http://thebiogrid.org/downloads/archives/Latest%20Release/BIOGRID-ALL-LATEST.tab.zip ; foaf:page http://thebiogrid.org/ .
We need to document and implement requirements for our system, as in this example: http://htmlpreview.github.io/?https://github.com/joejimbo/HCLSDatasetDescriptions/blob/master/Overview.html#appendix_1
BioGrid is a good example of a resource that also aggregates information from other sources (for example, in addition to curating their own data, they also pull in interaction data from flybase). For full tracking of that provenance chain, it would be good to mock up what the dataset description should look like here. @mbrush after mocking that up, turn the ticket over to @kshefchek or @bryanlaraway to implement the relevant methods in Dataset.py, and in turn apply them to the various sources currently in Dipper.