monarch-initiative / dipper

Data Ingestion Pipeline for Monarch
https://dipper.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
57 stars 26 forks source link

Simplify modeling patterns and eliminate blank nodes that do no work for us #560

Open cmungall opened 6 years ago

cmungall commented 6 years ago

Dipper should not introduce blank nodes unless either they are present in the structure of the source or if it does work for us in reasoning. The structure of the graph should reflect the current solr 'triples'.

Even in cases where these are introduced, we should place a shortcut relation.

This would need to be coordinated with https://github.com/monarch-initiative/monarch-cypher-queries/tree/master/src/main/cypher/golr-loader (the queries would become simpler as will be an isomorphic triple to triple mapping)

TomConlin commented 6 years ago

+1,000,000

kshefchek commented 6 years ago

Dipper should not introduce blank nodes unless either they are present in the structure of the source or if it does work for us in reasoning

I think we need to clearly define what this means. I'm guessing our most common use case for bnodes are instances where a source provides a label but does not expose a persistent ID.

jmcmurry commented 6 years ago

I believe it is also applied when the ID is persistent but not resolvable; Tom can you comment?

TomConlin commented 6 years ago

I intentionally did not mention our not-even-bnode identifier constructs created as IRI for under identified external objects to avoid expanding the scope of this issue. (but they are the first thing that come to mind)

one thought may be to establish a curie LCL: (shades of the fasta defline standard) which denote what follows is a LOCAL identifier with disclaimers to what that means

jmcmurry commented 6 years ago

What happened to our plan to create surrogate IDs in this stable-but-not-resolvable scenario?

TomConlin commented 6 years ago

I do not know, perhaps it was another thought NW was not able to communicate before she left?

cmungall commented 6 years ago

Sorry I wasn't precise in my original wording. Let's keep this ticket about modeling and not identifiers. I'm talking about the pattern of introducing a blank or otherwise surrogate node for an entity that is only implicit in the source.

The broader context is simplifying things for bioinformaticians. If someone is in front of their computer using either the neo4j interface or the scigraph swagger, and they have a gene G, and they want to find a phenotype P that is somehow associated, then their query should be SELECT * WHERE {G ?R ?P} (or whatever the cypher equivalent), not

      MATCH (locus:gene)<-[:GENO:0000418!]-(feature)
        WITH feature, COUNT(DISTINCT(locus)) as gene_count
        WHERE gene_count = 1
        AND NOT feature:snv
        MATCH path=(subject:gene)<-[geno:GENO:0000418!]-(feature)-[relation:RO:0002200|RO:0002326|RO:0003302!]->(object:Phenotype)
        RETURN path,
        subject, object, relation,
        'gene' AS subject_category,
        'phenotype' AS object_category,
        'direct' as qualifier
        UNION ALL
        MATCH path=(subject:gene)<-[geno:GENO:0000418!]-(feature:snv)-[relation:RO:0002200|RO:0002326|RO:0003302!]->(object:Phenotype)
        RETURN path,
        subject, object, relation,
        'gene' AS subject_category,
        'phenotype' AS object_category,
        'direct' as qualifier
        UNION ALL
        MATCH path=(subject:gene)<-[geno:GENO:0000418!*0..1]-(feature)-[relation:RO:0002200|RO:0002326|RO:0003302!]->(object:Phenotype)
        WHERE NOT ANY (rel in geno where rel.isDefinedBy="https://data.monarchinitiative.org/ttl/mgi.ttl")
        AND NOT ANY (rel in geno where "https://data.monarchinitiative.org/ttl/mgi.ttl" in rel.isDefinedBy)
        RETURN path,
        subject, object, relation,
        'gene' AS subject_category,
        'phenotype' AS object_category,
        'direct' as qualifier
        UNION ALL
        MATCH (relation:Node{iri:'http://purl.obolibrary.org/obo/RO_0002200'})
        WITH relation
        MATCH path=(subject:gene)<-[:GENO:0000418!*0..1]-(feature)-[:RO:0002200|RO:0003302!*2..]->(object:Phenotype)
        RETURN path,
        subject, object, relation,
        'gene' AS subject_category,
        'phenotype' AS object_category,
        'inferred through variant' as qualifier
        UNION ALL
        MATCH (relation:Node{iri:'http://purl.obolibrary.org/obo/RO_0002200'})
        WITH relation
        MATCH path=(subject:gene)<-[:GENO:0000418!*0..1]-(feature)<-[:BFO:0000051!*]-(genotype:genotype)-[rel:RO:0002200|RO:0002326|RO:0003302!*]->(object:Phenotype)
        WHERE NOT ANY (pheno_rel in rel where pheno_rel.isDefinedBy="https://data.monarchinitiative.org/ttl/mgi.ttl" OR pheno_rel.isDefinedBy="https://data.monarchinitiative.org/ttl/zfin.ttl")
        RETURN path,
        subject, object, relation,
        'gene' AS subject_category,
        'phenotype' AS object_category,
        'inferred' as qualifier
        UNION ALL
        MATCH path=(subject:gene)<-[:GENO:0000418]-(allele)-[:BFO:0000051!*]->(feature)-[relation:RO:0002200|RO:0002326|RO:0002610|RO:0003302!]->(object:Phenotype)
        RETURN path,
        subject, object, relation,
        'gene' AS subject_category,
        'phenotype' AS object_category,
        'inferred' as qualifier
        UNION ALL
        MATCH (relation:Node{iri:'http://purl.obolibrary.org/obo/RO_0002200'})
        WITH relation
        MATCH path=(subject:gene)<-[:GENO:0000418!*0..1]-(feature)<-[:BFO:0000051!*]-(genotype:genotype)<-[:GENO:0000222|RO:0001000*1..2]-(person)-[rel:RO:0002200|RO:0002326!*]->(object:Phenotype)
        WHERE NOT ANY (pheno_rel in rel where pheno_rel.isDefinedBy="https://data.monarchinitiative.org/ttl/udp.ttl")
        RETURN path,
        subject, object, relation,
        'gene' AS subject_category,
        'phenotype' AS object_category,
        'inferred' as qualifier

From : https://github.com/monarch-initiative/monarch-cypher-queries/blob/master/src/main/cypher/golr-loader/gene-phenotype.yaml

kshefchek commented 6 years ago

Anonymous human variants removed (when possible) in #602