monarch-initiative / dipper

Data Ingestion Pipeline for Monarch
https://dipper.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
56 stars 26 forks source link

Modeling genetic/genomic mapping locations #86

Open bryanlaraway opened 9 years ago

bryanlaraway commented 9 years ago

We need to properly model genetic/genomic landmark locations.

For example: (current ZFIN refactor) http://zfin.org/action/mapping/detail/ZDB-SSLP-980528-17 In ZFIN, we have mappings of different features (genes, sequence variants, ESTs, cDNA, SNPs, etc.) by various panels to single coordinates in either cM or cR. (Note that the cR value provided doesn't include the rads, so the cR value is technically incomplete.) Need to properly model these coordinates.

Another, more complex example: (future addition of AnimalQTLdb) http://www.animalgenome.org/cgi-bin/QTLdb/SS/qdetails?QTL_ID=1001 For AnimalQTLdb, we received mappings of traits in bp, cM, as well as nearby flanking markers (See FAQ #5 for flanking marker info). Note that for this resource, we may have single cM peak coordinates, a range of cM coordinates, or both. Locations in bp are provided with a start and stop bp. Up to five flanking markers can be provided, including a peak flanking marker.

What would be the appropriate triples in these examples?

nlwashington commented 9 years ago

So, the cM and cR are really linkage (genetic) map distance units, where bp are genomic (physical) count units. Any given genomic feature could have both coordinates. I feel like it's akin to, say, distances measured in GPS vs meters vs feet. They are all valid and accurate, may have different levels of resolution, and may need some reference information to perform conversions.

However, bp units are generally highly dependent on the build (for human, hg19 vs hg38). cM units probably don't change. cR units are, as @bryanlaraway says, somewhat dependent on the radiation dose given.

Definitions from ncbi: CentiMorgan (cM): A unit used to express distances on a genetic map. In genetic mapping, distances between markers are determined by measuring the rate of meoitic recombination between them, which increases proportionately with the distance separating them. A cM is defined as the length of an interval in which there is a 1% probability of recombination. On the average, 1 cM is roughly equivalent to 1 megabase (Mb) of DNA, although this can vary widely due to hot and cold spots of recombination.

CentiRay (cR):A unit of genetic map distance defined corresponding to an interval in which there is a 1% probability of X-irradiation induced breakage. To be completely specified, the unit must be qualified by the radiation in dosage in rads (e.g. cR8000), because this determines the actual breakage probability.

There is a lot of legacy phenotypic information for alleles that will only be specified in cM or cR, that we will want to search and operate on, generally integrated with the rest of our data.

cM and cR are in the unit ontology: http://purl.obolibrary.org/obo/UO_0000326 and http://purl.obolibrary.org/obo/UO_0000327, as is base pair http://purl.obolibrary.org/obo/UO_0000244.

Here's an example of a radiation panel description for the zfin example above. Interestingly, there are four of these maps that give slightly different cR coordinates for the same allele. http://zfin.org/action/mapping/panel-detail/ZDB-REFCROSS-980526-5

nlwashington commented 9 years ago

i have submitted an issue regarding use of faldo with genetic coordinates here: https://github.com/JervenBolleman/FALDO/issues/24