monarch-initiative / dipper

Data Ingestion Pipeline for Monarch
https://dipper.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
57 stars 26 forks source link

Review implementation of phenotype propagation #256

Open mbrush opened 8 years ago

mbrush commented 8 years ago

A couple separate issues here.

  1. Where and how will phenotype propagation get implemented going forward? At present propagation seems to be implemented via cypher queries here. Property chains defined in GENO are not leveraged, and 'inferred' G2P links are not materialized in any Monarch data (i.e. only appear in table layout in web app UI).
  2. What are the specific propagations we want to implement up and down a genotype, and where do we want to expressly avoid propagation (e.g. to genes contributing only markers or regulatory regions to transgenes, or perhaps to genes that are targeted by morpholinos only to prevent apoptotic death, such as p53 in ZFIN - see #233). What other gotcha's might we encounter?
  3. How are we going to distinguish asserted associations from those that are inferred? One approach is to make property chains result in a new relationship (e.g. inferred_to_cause_condition) to distinguish these inferred associations from the asserted ones using causes_condition.
  4. What tests exist to validate that phenotype propagation is working as expected?
  5. Are there other phenotype propagations we might consider (e.g. between orthologous genes?)
cmungall commented 8 years ago

cc @balhoff and @jnguyenx to help think about a general solution.

Looking at the complexities of the linked ticket, the easiest solution may be procedural code (that makes use of a declarative query language like SPARQL) which writes new edges into the graph along the manner you suggest.

After that we could explore a more generic rule engine driven more by the semantics of the ontology

kshefchek commented 7 years ago

We also run into this issue when modeling variant-phenotype associations and propagating to the gene.

When modeling variants that cover more than one gene we use the relation: GENO:0000418 ! has_affected_locus

However, in cases where a variant covers more than one gene (some haplotypes, large deletions), we may not want to propagate the variant-phenotype relation to each gene affected by the variation.

Currently investigating if we can get around this in our cypher queries with no luck.

kshefchek commented 7 years ago

This is how I plan to solve the variant-gene issue:

    MATCH (locus:gene)<-[:GENO:0000418!]-(feature)
    WITH feature, COUNT(DISTINCT(locus)) as gene_count
    WHERE gene_count = 1
    AND NOT feature:snp
    MATCH path=(subject:gene)<-[geno:GENO:0000418!]-(feature)-[:RO:0002200|RO:0002326|RO:0003302!]->(object:Phenotype)
    RETURN DISTINCT path, subject, object
    UNION
    MATCH path=(subject:gene)<-[geno:GENO:0000418!]-(feature:snp)-[:RO:0002200|RO:0002326|RO:0003302!]->(object:Phenotype)
    RETURN DISTINCT path, subject, object

We exclude snps from the filter because they can affect more than one gene, either via two genes on opposite strands, or overlapping genes on the same strand, for example: https://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=rs3827760 (two genes same strand) https://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=rs1551570 (two genes opposite strands)

This isn't perfect but will get the job done for now.