monarch-initiative / dipper

Data Ingestion Pipeline for Monarch
https://dipper.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
57 stars 26 forks source link

Fixes required for datasets.ttl #792

Closed cmungall closed 5 years ago

cmungall commented 5 years ago

Currently the dataset.ttl we produce has some issues. Example:

@prefix : <https://monarchinitiative.org/> .
@prefix MonarchArchive: <https://archive.monarchinitiative.org/201907/> .
@prefix OBO: <http://purl.obolibrary.org/obo/> .
@prefix dcat: <http://www.w3.org/ns/dcat#> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix dctypes: <http://purl.org/dc/dcmitype/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix pav: <http://purl.org/pav/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xml: <http://www.w3.org/XML/1998/namespace> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<https://archive.monarchinitiative.org/201907/ttl/ctd.ttl> a dctypes:Dataset ;
    dcterms:identifier "MonarchArchive:ttl/ctd.ttl" ;
    dcterms:issued "2019-07-16",
        "Wed-Jun-26-14-47-40-EDT-2019",
        "Wed-Jun-26-14-49-06-EDT-2019",
        "Wed-Jun-26-15-26-21-EDT-2019" ;
    dcterms:rights "http://ctdbase.org/about/legal.jsp" ;
    dcterms:title "Comparative Toxicogenomics Database" ;
    dcat:accessURL <http://ctdbase.org/reports/CTD_chemicals_diseases.tsv.gz>,
        <http://ctdbase.org/reports/CTD_genes_diseases.tsv.gz>,
        <http://ctdbase.org/reports/CTD_genes_pathways.tsv.gz> ;
    foaf:page <http://ctdbase.org> .

<https://archive.monarchinitiative.org/201907/ttl/ctd.ttlWed-Jun-26-14-47-40-EDT-2019> dcterms:isVersionOf "MonarchArchive:ttl/ctd.ttl" ;
    pav:version "Wed-Jun-26-14-47-40-EDT-2019" .

<https://archive.monarchinitiative.org/201907/ttl/ctd.ttlWed-Jun-26-14-49-06-EDT-2019> dcterms:isVersionOf "MonarchArchive:ttl/ctd.ttl" ;
    pav:version "Wed-Jun-26-14-49-06-EDT-2019" .

<https://archive.monarchinitiative.org/201907/ttl/ctd.ttlWed-Jun-26-15-26-21-EDT-2019> dcterms:isVersionOf "MonarchArchive:ttl/ctd.ttl" ;
    pav:version "Wed-Jun-26-15-26-21-EDT-2019" .

<https://monarchinitiative.org/2019-07-15T18:15:38> dcterms:isVersionOf "MonarchData:MonarchArchive:ttl/ctd.ttl.ttl" ;
    dcterms:issued "2019-07-15T18:15:38"^^xsd:dateTime ;
    pav:version "2019-07-15T18:15:38" .

But the main issue is that source and transform are conflated here. It's not appropriate we give a title of "CTD" (for example) to a URI that is our transform. Also the use of accessURL is confusing here. These are not access URLs! This seems to be representing the things that went into the transform.

What about the following structure, in which we have separate nodes for (1) the monarch transform (2) the source database (3) the individual files/URLs utilized. The structure would be 1 derived from 2, 3 part-of 2.

Ultimately we need to be able to drive the page here: https://beta.monarchinitiative.org/sources

Which is currently conflating things, e.g. the like called "Reactome" doesn't take you to reactome, but to our archive of our ingest. The dates also appear conflated between source and transform.

I think the columns should be:

  1. Source (+logo) linked to foaf:page for the source
  2. Date grabbed from source, or their version number if provided
  3. Date of our ingest, linked to our archive ttl

We can also have a dedicated per-source page where we show all metadata in the dataset.ttl

See also https://github.com/monarch-initiative/monarch-ui/issues/58

kshefchek commented 5 years ago

These are not currently loaded into scigraph. Is there anything wrong with grouping the data and dataset metadata all in one file? Either way is fine.

justaddcoffee commented 5 years ago

I had a think about how best to redo the metadata emitted by Dipper (currently emitted in dataset TTLs) to address this ticket and also https://github.com/monarch-initiative/monarch-ui/issues/58 and https://github.com/monarch-initiative/dipper/issues/753, and thought @cmungall @TomConlin @kshefchek might have some thoughts. I'll hopefully have a PR soon for this too.

Our metadata are (intending to) following the W3C HCLS spec for this - see especially figure 1 here For Dipper ingests, I think this basically translates to something like:

Summary level: The summary level provides a description of a dataset that is
independent of a specific version or format == e.g. the Monarch ingest of CTD

Version level: The version level captures version-specific characteristics of a
dataset == e.g. the 01-02-2018 ingest of CTD

Distribution level: The distribution level captures metadata about a specific form
and version of a dataset == e.g. the turtle file for 01-02-2018 ingest of CTD

We can write out at least the following triples to address issues in tickets, and make us vaguely HCLS compliant (again, see fig 1)

[summary level resource] --- rdf:type ---> dctypes:Dataset

[version level resource] --- rdf:type ---> dctypes:Dataset
[version level resource] --- dct:isVersionOf ---> [summary level resource]
[version level resource] --- pav:version --> [ingest timestamp]
[version level resource] --- dc:source ----> [source web page, e.g. omim.org]
[version level resource] --- schema:logo --> [source logo IRI]
[version level resource] --- dc:source ----> [source file 1 IRI]
[version level resource] --- dc:source ----> [source file 2 IRI]
...
[version level resource] --- void:dataset -> [distribution level resource]

[distribution level resource] --- rdf:type ---> dctypes:Dataset
[distribution level resource] --- rdf:type ---> dcat:Distribution
[distribution level resource] --- dcat:accessURL --> [MI ttl URL]
[distribution level resource] --- dcat:accessURL --> [MI nt URL]
...

[distribution level resource] --- void:triples --> [triples count (literal)]
[distribution level resource] --- void:distinctSubjects -> [subject count (literal)]
[distribution level resource] --- void:distinctObjects -> [object count (literal)]
...

[source file 1 IRI] -- pav:version ---> [download date timestamp]
[source file 2 IRI] -- pav:version ---> [source version (if set, optional)]
[source file 2 IRI] -- pav:version ---> [download date timestamp]
[source file 2 IRI] -- pav:version ---> [source version (if set, optional)]
...

We can add other metadata later, e.g. subj/obj categories counts, etc

justaddcoffee commented 5 years ago

addressed by #809