kordk / torch-ecpg

(GPU accelerated) eCpG mapper
BSD 3-Clause "New" or "Revised" License
2 stars 0 forks source link

Feature: eQTM mapping by region #7

Closed kordk closed 1 year ago

kordk commented 2 years ago

With regional annotations, we can filter the pairwise tests based on the genomic locations of the methylation locus and the gene locus.

Gene expression and methylation loci annotations will be provided in browser extensible data (BED) format.

The BED format is described here: https://genome.ucsc.edu/FAQ/FAQformat.html#format1

We will use a BED6 (i.e., the first four columns of the BED format description) design.

eQTM: expression quantitative trait methylation

Three eQTM mappings will be implemented:

The default ranges for the Cis-eQTM and Distal-eQTM are described above. Users will be able to set these values if desired.

The following annotation files have been create for the Illumina EPIC (methylation) and HT12 (gene expression) microarrays:

==>    annoEPIC.hg19.bed6  <==
chrom  chromStart          chromEnd   name          score  strand
20     61847650            61847650   cg18478105    0      -
X      24072640            24072640   cg09835024    0      -
9      131463936           131463936  cg14361672    0      +
17     80159506            80159506   cg01763666    0      +
14     105176736           105176736  cg12950382    0      +
13     115000168           115000168  cg02115394    0      +
X      38660511            38660511   cg25813447    0      -
X      14891349            14891349   cg07779434    0      -
12     12849159            12849159   cg13417420    0      -

==>    annoHT12.hg19.bed6  <==
chrom  chromStart          chromEnd   name          score  strand
2      128604584           128604633  ILMN_1792672  0      -
11     193773              193822     ILMN_3237022  0      +
13     44410552            44410601   ILMN_1904052  0      -
17     79524173            79524222   ILMN_1807600  0      -
18     19088353            19088402   ILMN_1805177  0      +
17     56280553            56280602   ILMN_1772631  0      +
1      33478820            33478870   ILMN_1716053  0      -
1      33478706            33478755   ILMN_1670542  0      -
2      6112790             6112839    ILMN_3268232  0      +
liamgd commented 1 year ago

Where are the two BED6 files located? If there is a universal file for Illumina EPIC and HT12, where is this downloaded, and should it be stored in the repository? Or are these files supplied at runtime along with the range flag?

kordk commented 1 year ago

I'll add them to the repository. We will need to include these files (or ones like them) for users to evaluate/test the tool.

In general, users will need to generate their own.

Here is where the files were generated with code (for reference):

# kord@pnldev [14:59:29] ~/proj/torch-ecpg-proj/annot $
ls -l
total 586436
-rw-rw-r-- 1 kord kord  90773183 Oct  1 11:14 annoEPIC.csv
-rw-rw-r-- 1 kord kord  31049598 Oct  1 14:43 annoEPIC.hg19.bed6
-rw-rw-r-- 1 kord kord   1592993 Oct  1 14:34 annoHT12.hg19.bed6
-rw-rw-r-- 1 kord kord      1237 Oct  1 14:42 createEPICHg19.R
-rw-rw-r-- 1 kord kord     13518 Oct  1 14:43 createEPICHg19.R.out
-rw-rw-r-- 1 kord kord       919 Oct  1 14:36 createHT12hg19.R
-rw-rw-r-- 1 kord kord      5142 Oct  1 14:34 createHT12hg19.R.out
-rw-rw-r-- 1 kord kord 469427673 Jan 10  2020 hg19.ensGene.gtf
-rw-rw-r-- 1 kord kord   7623902 Aug 12  2015 humanHt12v4.re-annotator.26426330.txt
kordk commented 1 year ago

Pushed files into demo/ in [main 8a6fa41]

liamgd commented 1 year ago

How should the megabases be calculated between different loci? Each locum has a chromStart and chromEnd value. I can think of a few ways to calculate the distance between loci:

abs(end2 - end1) # Distance between ends
abs(start2 - start1) # Distance between starts
abs((end2 + start2) / 2 - (end1 + start1) / 2) # Distance between centers

Which one of these is preferred?

liamgd commented 1 year ago

Also, are the score or stand values used in the region filtering?

kordk commented 1 year ago

The reference position will be the gene expression data chromStart value. We will define this position as the transcript start site. Ignore chromEnd.

To find the Mt loci for comparison for Cis and Distal, just identify those withing X bases of a given gene's chromStart value on the same chromosome. For trans loci, it's all of the Mt loci on a different chromosome.

As for the strand: The strand gives us the direction of the gene body with relation to the reference assembly. DNA has two strands, and one was arbitrarily selected as the reference strand. The strand only really matters when we need to consider flanking direction in our calculations. For what we are doing (i.e., just using equal distances from the chromStart values for both Mt and Gx data) this won't matter. If we cared about different distances for up versus down stream then we would have to consider it.

For example:

liamgd commented 1 year ago

Implemented in bf696eb.