monarch-initiative / dipper

Data Ingestion Pipeline for Monarch
https://dipper.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
57 stars 26 forks source link

Representation of molecular phenotype data #293

Open mbrush opened 8 years ago

mbrush commented 8 years ago

We are starting to pull data that about 'molecular phenotypes' which inhere in molecular entities, activities, or processes. These include things like variants causing altered altered expression, activity, binding, stability, or localization of specific gene products. The neXtProt data described in #292 data is a good example of such a source. There are several approaches we might take to representing this data. I pose some questions below to help narrow in on a preferred approach.

Question 1: Represent this data as direct assertions about the affected gene product, or as G2P associations about the genomic variant? For example, consider the assertion from neXtProt that the protein 'BRCA1-p.Ala1708Glu' exhibits decreased localization to the nucleus. neXtProt represents this as a direct relationship between the affected protein and the nucleus, where the nature of the phenotype is captured in the property that links them:

 BRCA1-p.Ala1708Glu    decreased_localization_to      nucleus

neXtProt has created ~20 such properties used to create similar statements that describe various molecular/functional abnormalities, e.g.:

  BRCA1-p.Asp1778Ala       increased_localization_to        cytoplasm
  BRCA1-p.Asp1778Ala       is_a_labile_form_of              BRCA1
  BRCA1-p.Asp1778Ala       decreased_binding_to             BRIP1
  BRCA1-p.Ser988Ala        removes_PTM_site                 BRCA1-pSer988

Use of such properties to describe the gene product is one valid approach to representing this data. However, implicit in such statements about variant protein activity or localization is a G2P association between the causal genomic variation and a phenotype representing the molecular defect (i.e. that genomic variant 'BRCA1-c.3891G>A' causes a phenotype where the BRCA protein exhibits decreased nuclear localization). Framing these assertions in such a way (as G2P associations) would better interoperate with other data in Monarch and phenopackets. Here, we would represent the BRCA1-c.3891G>A localization example as a G2P association such as:

 'BRCA1-c.3891G>A'     has_phenotype    'decreased localization of BRCA1 to the nucleus'

Question 2: Assuming we chose G2P approach, at what granularity do we compose phenotype terms? This relates to the question discussed in phenopackets issue #51 of pre-composing terms vs post-composing expressions of molecular phenotypes when a specific molecule/gene product is involved. Given that the sheer number of molecules such phenotypes could describe is on the order of hundreds of thousands it seems prohibitive to pre-compose classes such as the 'decreased localization of BRCA1 to the nucleus' phenotype above, or phenotypes like 'increased expression of HER2'.

Instead, there are several approaches that could be taken to formally 'post-compose' expressions in the data that describe such phenotypes, but I will hold off on addressing these until Questions 1 and 2 above are resolved.

Specifically, I am looking to affirm my presumptions that:

  1. we do want to frame data this data as G2P associations
  2. we don't want to create pre-composed, granular phenotype classes for things like 'decreased localization of BRCA1 to the nucleus' and 'increased expression of HER2'.

Comments from @cmungall, @mellybelly, others welcome.

mbrush commented 8 years ago

Following up to explore how related models describe G2P associations for molecular phenotypes, and how they might related to or inform proposed DIPper modelling approaches. Specifically:

I made some notes for initial review and discussion in the google doc here - will record final outcomes in this ticket.