Closed cmungall closed 5 years ago
These are not currently loaded into scigraph. Is there anything wrong with grouping the data and dataset metadata all in one file? Either way is fine.
I had a think about how best to redo the metadata emitted by Dipper (currently emitted in dataset TTLs) to address this ticket and also https://github.com/monarch-initiative/monarch-ui/issues/58 and https://github.com/monarch-initiative/dipper/issues/753, and thought @cmungall @TomConlin @kshefchek might have some thoughts. I'll hopefully have a PR soon for this too.
Our metadata are (intending to) following the W3C HCLS spec for this - see especially figure 1 here For Dipper ingests, I think this basically translates to something like:
Summary level: The summary level provides a description of a dataset that is
independent of a specific version or format == e.g. the Monarch ingest of CTD
Version level: The version level captures version-specific characteristics of a
dataset == e.g. the 01-02-2018 ingest of CTD
Distribution level: The distribution level captures metadata about a specific form
and version of a dataset == e.g. the turtle file for 01-02-2018 ingest of CTD
We can write out at least the following triples to address issues in tickets, and make us vaguely HCLS compliant (again, see fig 1)
[summary level resource] --- rdf:type ---> dctypes:Dataset
[version level resource] --- rdf:type ---> dctypes:Dataset
[version level resource] --- dct:isVersionOf ---> [summary level resource]
[version level resource] --- pav:version --> [ingest timestamp]
[version level resource] --- dc:source ----> [source web page, e.g. omim.org]
[version level resource] --- schema:logo --> [source logo IRI]
[version level resource] --- dc:source ----> [source file 1 IRI]
[version level resource] --- dc:source ----> [source file 2 IRI]
...
[version level resource] --- void:dataset -> [distribution level resource]
[distribution level resource] --- rdf:type ---> dctypes:Dataset
[distribution level resource] --- rdf:type ---> dcat:Distribution
[distribution level resource] --- dcat:accessURL --> [MI ttl URL]
[distribution level resource] --- dcat:accessURL --> [MI nt URL]
...
[distribution level resource] --- void:triples --> [triples count (literal)]
[distribution level resource] --- void:distinctSubjects -> [subject count (literal)]
[distribution level resource] --- void:distinctObjects -> [object count (literal)]
...
[source file 1 IRI] -- pav:version ---> [download date timestamp]
[source file 2 IRI] -- pav:version ---> [source version (if set, optional)]
[source file 2 IRI] -- pav:version ---> [download date timestamp]
[source file 2 IRI] -- pav:version ---> [source version (if set, optional)]
...
We can add other metadata later, e.g. subj/obj categories counts, etc
addressed by #809
Currently the dataset.ttl we produce has some issues. Example:
dcterms:identifier "MonarchArchive:ttl/ctd.ttl"
- intentionally a literal?"MonarchData:MonarchArchive:ttl/ctd.ttl.ttl"
But the main issue is that source and transform are conflated here. It's not appropriate we give a title of "CTD" (for example) to a URI that is our transform. Also the use of accessURL is confusing here. These are not access URLs! This seems to be representing the things that went into the transform.
What about the following structure, in which we have separate nodes for (1) the monarch transform (2) the source database (3) the individual files/URLs utilized. The structure would be 1 derived from 2, 3 part-of 2.
Ultimately we need to be able to drive the page here: https://beta.monarchinitiative.org/sources
Which is currently conflating things, e.g. the like called "Reactome" doesn't take you to reactome, but to our archive of our ingest. The dates also appear conflated between source and transform.
I think the columns should be:
We can also have a dedicated per-source page where we show all metadata in the dataset.ttl
See also https://github.com/monarch-initiative/monarch-ui/issues/58