monarch-initiative / dipper

Data Ingestion Pipeline for Monarch
https://dipper.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
57 stars 26 forks source link

Linking broadly located variants to chromosomal regions #58

Open mbrush opened 9 years ago

mbrush commented 9 years ago

Some data sources provide only very broad location information about a sequence alteration (i.e at the level of a chromosome region instead of within a specific gene/marker). @nlwashington can provide examples from the data here.

How should we capture this information in a genotype graph? If we treat the location as just a very large marker, we can capture it the same way as we capture marker based locations by linking the alteration to the marker of which it is a sequence-variant. Then in our genotype syntax we would need some convention for labeling this broad 'marker' in a given variant locus. Triples might look like:

mmusChr11p<alt-x>       is_variant_allele_of          mmusChr11p
mmusChr11p<alt-x>         has_variant_part            <alt-x>
mmusChr11p<alt-x>              rdf:type               GENO:'variant locus'
<alt-x>                        rdf:type               SO:sequence alteration
mmusChr11p                  rdfs:subClassOf           SO:chromosome arm

Note that we are creating classes for chromosomal regions/bands (e.g. mmusChr11p) as per #42 and #43.

While the above is one option, I don't think it is practical or useful to define a 'marker' that spans an entire chromosome or band. I would prefer here to forgo creation of the variant locus level in the genotype graph, and link the sequence alteration to its broad chromosomal location in a new triple. The triple might look like:

<alt-x>          [objectProperty]         mmusChr11p

Here the object property could be is_subsequence_of or is_variant_part_of (the latter being used if we want to propagate phenotypes over this link). In its implementation, this triple will pun the chromosomal region class (it links an instance IRI to a class IRI).

nlwashington commented 9 years ago

also, i found this page useful to describe the different banding patterns: http://www.pathology.washington.edu/galleries/Cytogallery/main.php?file=banding%20patterns

nlwashington commented 9 years ago

Here's an example of a thing in NCBI: http://www.ncbi.nlm.nih.gov/gene/4384, which has the type of "unknown" in the DB. it is "located" on Xp11-q21.

there is the concept of fuzzy positions in faldo, and perhaps that is what is needed here?

cmungall commented 9 years ago

I think fuzzy locations in faldo are to represent fuzzy locations in genbank, which would not generally be used here.

Here we could represent a region that is starts on Xp11 and ends on q21, and say that the gene's interval is part of this interval. But perhaps it's just fine to make a more general statement - it's located on X. How long will these genes remain unlocated?

On 18 Mar 2015, at 16:00, Nicole Washington wrote:

Here's an example of a thing in NCBI: http://www.ncbi.nlm.nih.gov/gene/4384, which has the type of "unknown" in the DB. it is "located" on Xp11-q21.

there is the concept of fuzzy positions in faldo, and perhaps that is what is needed here?


Reply to this email directly or view it on GitHub: https://github.com/monarch-initiative/dipper/issues/58#issuecomment-83218791

nlwashington commented 9 years ago

another example are some omim diseases that are annotated to genomic regions.

for example, http://omim.org/entry/101850 is known to map to a broad region of 2p25-p12.

clearly some feature lies within the region defined by chr2p25 and chr2p12. but faldo:positions are not to regions. also, the current pattern we've been using is to say that a given X (region) is a subsequence of Y (chromosome band). but that doesn't seem right here.

nlwashington commented 9 years ago

similarly.... for some sequence variants, we only know the gene that they map to. this is the case for all variants from zfin.

kshefchek commented 6 years ago

@mbrush can we close this?