monarch-initiative / dipper

Data Ingestion Pipeline for Monarch
https://dipper.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
57 stars 26 forks source link

deal with non-unique chr and chr bands #48

Open nlwashington opened 9 years ago

nlwashington commented 9 years ago

Many of the genes are mapped to non-unique bands and chromosomes. For example: for non-unique chromosomes, they are usually pseudoautosomal regions of X and Y, and are listed as X|Y.

for non-unique bands, examples are: 15q11-q22, Xp21.2-p11.23, 15q22-qter, 10q11.1-q24, 12p13.3-p13.2|12p13-p12, 1p13.3|1p21.3-p13.1, 12cen-q21, 22q13.3|22q13.3

these need to be dealt with in the parser. the spans are relatively easy to do, but the optional regions (delimited with a pipe) are not as easy.

@mbrush or @cmungall how would you model a gene that is localized to optional regions X OR Y? these are not "fuzzy regions" as in faldo, which really are related to borders that are not specifically defined by coordinates; these are truly an OR. additionally, the PAR regions are actually AND.

nlwashington commented 9 years ago

for genes on pseudoautosomal regions or the regions delimited with the pipe, i can simply link to both of the chromosomal parents.

for the regions that span broad bands, that's tough. i could either:

  1. link to the most-specific region i can (which might be the level of an arm. for example, 15q11-22 i could just link to 15q.)
  2. @mbrush can i list a feature as a "start position"? we want to say that the feature lies within a broader span, but i am unclear how to say that it is "somewhere between featureX and feature Y". perhaps using the InRangePosition something like:
myfeature:123 location  faldo:location  _myfeatureRegion
_myfeatureRegion a faldo:Region
     faldo:begin _myfeatureRangeBeginPosition
     faldo:end _myfeatureRangeEndPosition
_myfeatureRangeBeginPosition  a  faldo:InRangePosition ;
     faldo:begin  _:XBegin ;
     faldo:end   _:XEnd .
_:XBegin a faldo:ExactPosition ;
     faldo:position  1 ;
     faldo:reference  myfeature:X .
_:XEnd a  faldo:ExactPosition ;
     faldo:position nnnn ;   #the length of the band
     faldo:reference  myfeature:X .
_myfeatureRangeEndPosition  a  faldo:InRangePosition ;
     faldo:begin  _:YBegin ;
     faldo:end   _:YEnd .
_:YBegin a faldo:ExactPosition ;
     faldo:position  1 ;
     faldo:reference  myfeature:Y .
_:YEnd a  faldo:ExactPosition ;
     faldo:position nnnn ;   #the length of the band
     faldo:reference  myfeature:Y .

I wonder if we can leave off the "positional" coordinate of a position? for example, if we don't know the length of Y, then we might not want to list the position.

nlwashington commented 9 years ago

@mbrush or @cmungall comments on best way to map a feature to lie within an uncertain region as in "15q11-22", above?

nlwashington commented 9 years ago

After a deeper dive into the data in the NCBI gene_info download, I am hesitant to actually import these chr band ranges. I have found flaws in some of the data when I go look at the webpages directly at NCBI (inconsistencies between what is labeled as "location" and what is shown in a genome browser). I think the best solution here is to derive these overlaps by importing the genomic coordinates of the genes themselves from either gff3 (from NCBI or ENSEMBL) or from genetic maps from authoritative resources.

Conclusion: For this NCBI resource, I will only link a gene to a chr (or chr band) when there is an unambiguous (single) chromosome and band that it maps to. (With the exception of X|Y, since those ones are usually to genes in the PAR.)

cmungall commented 9 years ago

How would these be used? Would they be visible on the ideogram view.

The most compact way to do this would be by defining properties such as startsWithin and endsWithin, being the composition of {hasStart, hasEnd} o overlaps

On 22 Apr 2015, at 21:44, Nicole Washington wrote:

for genes on pseudoautosomal regions or the regions delimited with the pipe, i can simply link to both of the chromosomal parents.

for the regions that span broad bands, that's tough. i could either:

  1. link to the most-specific region i can (which might be the level of an arm. for example, 15q11-22 i could just link to 15q.)
  2. @mbrush can i list a feature as a "start position"? we want to say that the feature lies within a broader span, but i am unclear how to say that it is "somewhere between featureX and feature Y". perhaps using the InRangePosition something like:
myfeature:123 location  faldo:location  _myfeatureRegion
_myfeatureRegion a faldo:Region
  faldo:begin _myfeatureRangeBeginPosition
  faldo:end _myfeatureRangeEndPosition
_myfeatureRangeBeginPosition  a  faldo:InRangePosition ;
  faldo:begin  _:XBegin ;
  faldo:end   _:XEnd .
_:XBegin a faldo:ExactPosition ;
  faldo:position  1 ;
  faldo:reference  myfeature:X .
_:XEnd a  faldo:ExactPosition ;
  faldo:position nnnn ;   #the length of the band
  faldo:reference  myfeature:X .
_myfeatureRangeEndPosition  a  faldo:InRangePosition ;
  faldo:begin  _:YBegin ;
  faldo:end   _:YEnd .
_:YBegin a faldo:ExactPosition ;
  faldo:position  1 ;
  faldo:reference  myfeature:Y .
_:YEnd a  faldo:ExactPosition ;
  faldo:position nnnn ;   #the length of the band
  faldo:reference  myfeature:Y .

I wonder if we can leave off the "positional" coordinate of a position? for example, if we don't know the length of Y, then we might not want to list the position.


Reply to this email directly or view it on GitHub: https://github.com/monarch-initiative/dipper/issues/48#issuecomment-95430739

kshefchek commented 6 years ago

@mbrush can we close?