phenopackets / phenopacket-schema

Repository for the GA4GH phenopacket schema
https://phenopacket-schema.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
79 stars 30 forks source link

First-pass phenopacket for mouse data #83

Closed pnrobinson closed 5 years ago

pnrobinson commented 5 years ago

I added a new test class (branch mgi_model) that creates a phenopacket to describe this mouse model http://www.informatics.jax.org/allele/MGI:3690325 I think we need to add a new variant type -- I would suggest we ask Judy,Carol,Terry and others what they think would work best. Do we have any other suggestions? @mellybelly @cmungall @julesjacobsen

{
  "id": "",
  "subject": {
    "id": "MGI:3690326",
    "datasetId": "",
    "sex": "UNKNOWN_SEX",
    "karyotypicSex": "UNKNOWN_KARYOTYPE"
  },
  "phenotypes": [{
    "description": "",
    "type": {
      "id": "MP:0004044",
      "label": "aortic dissection"
    },
    "negated": false,
    "modifiers": [],
    "evidence": []
  }, {
    "description": "",
    "type": {
      "id": "MP:0000150",
      "label": "abnormal rib morphology"
    },
    "negated": false,
    "modifiers": [],
    "evidence": []
  }, {
    "description": "",
    "type": {
      "id": "MP:0000160",
      "label": "kyphosis"
    },
    "negated": false,
    "modifiers": [],
    "evidence": []
  }, {
    "description": "",
    "type": {
      "id": "MP:0001183",
      "label": "overexpanded pulmonary alveoli"
    },
    "negated": false,
    "modifiers": [],
    "evidence": []
  }, {
    "description": "",
    "type": {
      "id": "MP:0003211",
      "label": "abnormal aorta elastic fiber morphology"
    },
    "negated": false,
    "modifiers": [],
    "evidence": []
  }, {
    "description": "",
    "type": {
      "id": "MP:0006120",
      "label": "mitral valve prolapse"
    },
    "negated": false,
    "modifiers": [],
    "evidence": []
  }, {
    "description": "",
    "type": {
      "id": "MP:0003923",
      "label": "abnormal heart left atrium morphology"
    },
    "negated": false,
    "modifiers": [],
    "evidence": []
  }, {
    "description": "",
    "type": {
      "id": "MP:0003921",
      "label": "abnormal heart left ventricle morphology"
    },
    "negated": false,
    "modifiers": [],
    "evidence": []
  }, {
    "description": "",
    "type": {
      "id": "MP:0010996",
      "label": " increased aorta wall thickness"
    },
    "negated": false,
    "modifiers": [],
    "evidence": []
  }],
  "biosamples": [],
  "genes": [],
  "variants": [{
    "hgvsAllele": {
      "id": "",
      "hgvs": "Fbn1tm1Hcd"
    },
    "genotype": {
      "id": "GENO:0000135",
      "label": "heterozygous"
    }
  }],
  "diseases": [],
  "htsFiles": [],
  "metaData": {
    "createdBy": "Peter",
    "submittedBy": "",
    "resources": [{
      "id": "mp",
      "name": "mammalian phenotype ontology",
      "namespacePrefix": "MP",
      "url": "http://purl.obolibrary.org/obo/mp.owl",
      "version": "2019-03-08",
      "iriPrefix": "http://purl.obolibrary.org/obo/MP_"
    }, {
      "id": "geno",
      "name": "Genotype Ontology",
      "namespacePrefix": "GENO",
      "url": "http://purl.obolibrary.org/obo/geno.owl",
      "version": "19-03-2018",
      "iriPrefix": "http://purl.obolibrary.org/obo/GENO_"
    }],
    "updated": [],
    "externalReferences": [{
      "id": "PMID:15254584",
      "description": "Heterozygous Fbn1 C1039G mutation. Judge DP, Biery NJ, Keene DR, Geubtner J, Myers L, Huso DL, Sakai LY, Dietz\nHC. Evidence for a critical contribution of haploinsufficiency in the complex\npathogenesis of Marfan syndrome. J Clin Invest. 2004;114(2):172-81."
    }]
  }
}
mellybelly commented 5 years ago

hmmmm hgvs allele doesn't really work - maybe something more generic? and we could do better w evidence. @mbrush what should be included here?

One of the problems we have with model organism phenopackets is they will necessarily represent populations rather than individuals. e.g. there can be variable penetrance & expressivity of a phenotype for a given genotype.

julesjacobsen commented 5 years ago

For mouse we'd want background and genotype e.g.

  Allelic Composition Genetic Background
hm1 Fbn1tm1Hcd/Fbn1tm1Hcd involves: 129S1/Sv 129X1/SvJ C57BL/6J
ht2  Disease Model Fbn1tm1Hcd/Fbn1+ involves: 129S1/Sv 129X1/SvJ C57BL/6J
cx3  Disease Model Fbn1tm1Hcd/Fbn1+Tgfb2tm1Doe/Tgfb2+ involves: 129P2/OlaHsd 129S1/Sv 129X1/SvJ * C57BL/6J  
julesjacobsen commented 5 years ago

So your current example

"variants": [{
    "hgvsAllele": {
      "id": "",
      "hgvs": "Fbn1tm1Hcd"
    },
    "genotype": {
      "id": "GENO:0000135",
      "label": "heterozygous"
    }
  }]

might be better modeled like this?

message Variant {
    oneof allele {
        HgvsAllele hgvs_allele = 2;
        VcfAllele vcf_allele = 3;
        SpdiAllele spdi_allele = 4;
        IscnAllele iscn_allele = 5;
        MouseAllele mouse_allele = 8;
    }
    // Genotype of the alleles using GENO ontology
    OntologyClass genotype = 6;
    // For mice the background of the variant is also required 
    // e.g. involves: 129S1/Sv * 129X1/SvJ * C57BL/6J 
    // see http://www.informatics.jax.org/allele/MGI:3690325#phenotypes for examples
    String background = 7;
}

// See http://informatics.jax.org/mgihome/nomen/
// To encode the allele Fbn1<sup>tm1Hcd</sup> 
message MouseAllele {
    string id = 1;
    // e.g., Fbn1
    string gene = 2;
    // The allele_code should be used for the allele name or lab code, which is written  
    // in superscript  according  to the International Committee on Standardized Genetic 
    // Nomenclature for Mice 
    // e.g. tm1Hcd
    string allele_code = 3;
}

So using the above the example would become:

ht2

"variants": [{
    "mouseAllele": {
      "id": "",
      "gene": "Fbn1",
      "allele_name": "tm1Hcd"
    },
    "genotype": {
      "id": "GENO:0000135",
      "label": "heterozygous"
    },
    "background": "involves: 129S1/Sv * 129X1/SvJ * C57BL/6J"
  }]

hm1

"variants": [{
    "mouseAllele": {
      "id": "",
      "gene": "Fbn1",
      "allele_name": "tm1Hcd"
    },
    "genotype": {
      "id": "GENO:0000136",
      "label": "homozygous"
    },
    "background": "involves: 129S1/Sv * 129X1/SvJ * C57BL/6J"
  }]

cx3:

"variants": [{
    "mouseAllele": {
      "id": "",
      "gene": "Fbn1",
      "allele_name": "tm1Hcd"
    },
    "genotype": {
      "id": "GENO:0000135",
      "label": "heterozygous"
    },
    "background": "involves: 129P2/OlaHsd * 129S1/Sv * 129X1/SvJ * C57BL/6J "
  },
  {
    "mouseAllele": {
      "id": "",
      "gene": "Tgfb2",
      "allele_name": "tm1Doe"
    },
    "genotype": {
      "id": "GENO:0000135",
      "label": "heterozygous"
    },
    "background": "involves: 129P2/OlaHsd * 129S1/Sv * 129X1/SvJ * C57BL/6J "
  }]
julesjacobsen commented 5 years ago

@mellybelly

One of the problems we have with model organism phenopackets is they will necessarily represent populations rather than individuals. e.g. there can be variable penetrance & expressivity of a phenotype for a given genotype.

Perhaps, but not always. For example an IMPC knockout will have an equal number of males and females for any given knockout. These could be represented as a cohort with each mouse' specific phenotypes recorded in a distinct phenopacket. This allows for phenotypic variability for a genotype. For a more nebulous 'mouse model' i.e. an amalgam we ought to have a frequency associated with the phenotype. Currently this can be represented as an ontology term in the Phenotype.modifiers field.

pnrobinson commented 5 years ago

Hi Jules, I like your suggestion for the mouse allele message. Should we implement it and revise the documentation accordingly? @mellybelly -- note that IMPC could represent its data as a cohort of Phenopackets, and we should talk to Terry about whether this is of interest/relevant.

I have made a PR for the current status of the documentation -- @julesjacobsen I do not want to get things too mixed up, please let me know how we should proceed to implement the mouseAllele class? https://github.com/phenopackets/phenopacket-schema/pull/86

julesjacobsen commented 5 years ago

Given you like my suggestion I'll implement it and push to master for you to pull and then we can merge your changes.

julesjacobsen commented 5 years ago

See commit: 8a3114054df6ee27259600222ca8205c8917f1aa

julesjacobsen commented 5 years ago

Just another thought - MouseAllele is a pretty shabby name. perhaps RodentAllele would be more accurate, but whats the acronym for the nomenclature committee? They seem to have been agreed by both the mouse and rat communities.

cmungall commented 5 years ago

I would get input from Cindy and Mary

Haven't looked at your schema closely but does it handle transgene alleles?

I'd opt for a more generic scheme here that is extensible to any nomenclature, e.g. a tuple of string and nomenclature. I'd use gene IDs for genes

julesjacobsen commented 5 years ago

I believe it can, when used as part of the gene field.

http://www.informatics.jax.org/mgihome/nomen/gene.shtml#transg

Cynthia and Carol are having a look through this on Thursday.

cmungall commented 5 years ago

I think I commented on this in a bit of a hurry before. Looks like @cindyJax has taken a look as I see some other tickets. I chatted very briefly with Terry last week.

I don't have much to add beyond what Cindy already posted in her tickets (thanks!) and what @mellybelly said earlier in this ticket: "One of the problems we have with model organism phenopackets is they will necessarily represent populations rather than individuals. e.g. there can be variable penetrance & expressivity of a phenotype for a given genotype." - I'd go further, in fact the example here is for an allele, not a population!

I think it's worthwhile to experiment with extending phenopackets beyond the scope of representing individual human patients, but I would be more comfortable doing this on a branch rather than master. I am worried about consequences of both broadening the scope of what "subject" means in phenopackets as well as baking in assumptions about how model organism databases model complex genotypes.

pnrobinson commented 5 years ago

I think we can leave model organisms for v2. IMPC in principle has individual level phenotypes and this was the original thought. In principle, we should have another message type, such as Model, that would use the elements appropriately.

cindyJax commented 5 years ago

Somehow I missed this whole thread before I posted the other tickets (sorry about the number, trying to be thorough). We (at MGI) had discussed the issue at hand - how do mouse populations relate to individual patient data? Much of what you have pulled together so far is quite human-centric.

pnrobinson commented 5 years ago

After many discussions, the decision is to table the mouse/model phenopacket until the VR group have finalized their model so that we can represent the variants in VR