Open cmungall opened 6 years ago
+1,000,000
Dipper should not introduce blank nodes unless either they are present in the structure of the source or if it does work for us in reasoning
I think we need to clearly define what this means. I'm guessing our most common use case for bnodes are instances where a source provides a label but does not expose a persistent ID.
I believe it is also applied when the ID is persistent but not resolvable; Tom can you comment?
I intentionally did not mention our not-even-bnode identifier constructs created as IRI for under identified external objects to avoid expanding the scope of this issue. (but they are the first thing that come to mind)
one thought may be to establish a curie LCL:
(shades of the fasta defline standard) which
denote what follows is a LOCAL identifier with disclaimers to what that means
What happened to our plan to create surrogate IDs in this stable-but-not-resolvable scenario?
I do not know, perhaps it was another thought NW was not able to communicate before she left?
Sorry I wasn't precise in my original wording. Let's keep this ticket about modeling and not identifiers. I'm talking about the pattern of introducing a blank or otherwise surrogate node for an entity that is only implicit in the source.
The broader context is simplifying things for bioinformaticians. If someone is in front of their computer using either the neo4j interface or the scigraph swagger, and they have a gene G
, and they want to find a phenotype P
that is somehow associated, then their query should be SELECT * WHERE {G ?R ?P}
(or whatever the cypher equivalent), not
MATCH (locus:gene)<-[:GENO:0000418!]-(feature)
WITH feature, COUNT(DISTINCT(locus)) as gene_count
WHERE gene_count = 1
AND NOT feature:snv
MATCH path=(subject:gene)<-[geno:GENO:0000418!]-(feature)-[relation:RO:0002200|RO:0002326|RO:0003302!]->(object:Phenotype)
RETURN path,
subject, object, relation,
'gene' AS subject_category,
'phenotype' AS object_category,
'direct' as qualifier
UNION ALL
MATCH path=(subject:gene)<-[geno:GENO:0000418!]-(feature:snv)-[relation:RO:0002200|RO:0002326|RO:0003302!]->(object:Phenotype)
RETURN path,
subject, object, relation,
'gene' AS subject_category,
'phenotype' AS object_category,
'direct' as qualifier
UNION ALL
MATCH path=(subject:gene)<-[geno:GENO:0000418!*0..1]-(feature)-[relation:RO:0002200|RO:0002326|RO:0003302!]->(object:Phenotype)
WHERE NOT ANY (rel in geno where rel.isDefinedBy="https://data.monarchinitiative.org/ttl/mgi.ttl")
AND NOT ANY (rel in geno where "https://data.monarchinitiative.org/ttl/mgi.ttl" in rel.isDefinedBy)
RETURN path,
subject, object, relation,
'gene' AS subject_category,
'phenotype' AS object_category,
'direct' as qualifier
UNION ALL
MATCH (relation:Node{iri:'http://purl.obolibrary.org/obo/RO_0002200'})
WITH relation
MATCH path=(subject:gene)<-[:GENO:0000418!*0..1]-(feature)-[:RO:0002200|RO:0003302!*2..]->(object:Phenotype)
RETURN path,
subject, object, relation,
'gene' AS subject_category,
'phenotype' AS object_category,
'inferred through variant' as qualifier
UNION ALL
MATCH (relation:Node{iri:'http://purl.obolibrary.org/obo/RO_0002200'})
WITH relation
MATCH path=(subject:gene)<-[:GENO:0000418!*0..1]-(feature)<-[:BFO:0000051!*]-(genotype:genotype)-[rel:RO:0002200|RO:0002326|RO:0003302!*]->(object:Phenotype)
WHERE NOT ANY (pheno_rel in rel where pheno_rel.isDefinedBy="https://data.monarchinitiative.org/ttl/mgi.ttl" OR pheno_rel.isDefinedBy="https://data.monarchinitiative.org/ttl/zfin.ttl")
RETURN path,
subject, object, relation,
'gene' AS subject_category,
'phenotype' AS object_category,
'inferred' as qualifier
UNION ALL
MATCH path=(subject:gene)<-[:GENO:0000418]-(allele)-[:BFO:0000051!*]->(feature)-[relation:RO:0002200|RO:0002326|RO:0002610|RO:0003302!]->(object:Phenotype)
RETURN path,
subject, object, relation,
'gene' AS subject_category,
'phenotype' AS object_category,
'inferred' as qualifier
UNION ALL
MATCH (relation:Node{iri:'http://purl.obolibrary.org/obo/RO_0002200'})
WITH relation
MATCH path=(subject:gene)<-[:GENO:0000418!*0..1]-(feature)<-[:BFO:0000051!*]-(genotype:genotype)<-[:GENO:0000222|RO:0001000*1..2]-(person)-[rel:RO:0002200|RO:0002326!*]->(object:Phenotype)
WHERE NOT ANY (pheno_rel in rel where pheno_rel.isDefinedBy="https://data.monarchinitiative.org/ttl/udp.ttl")
RETURN path,
subject, object, relation,
'gene' AS subject_category,
'phenotype' AS object_category,
'inferred' as qualifier
Anonymous human variants removed (when possible) in #602
Dipper should not introduce blank nodes unless either they are present in the structure of the source or if it does work for us in reasoning. The structure of the graph should reflect the current solr 'triples'.
Even in cases where these are introduced, we should place a shortcut relation.
This would need to be coordinated with https://github.com/monarch-initiative/monarch-cypher-queries/tree/master/src/main/cypher/golr-loader (the queries would become simpler as will be an isomorphic triple to triple mapping)