Closed kordk closed 1 year ago
Where are the two BED6 files located? If there is a universal file for Illumina EPIC and HT12, where is this downloaded, and should it be stored in the repository? Or are these files supplied at runtime along with the range flag?
I'll add them to the repository. We will need to include these files (or ones like them) for users to evaluate/test the tool.
In general, users will need to generate their own.
Here is where the files were generated with code (for reference):
# kord@pnldev [14:59:29] ~/proj/torch-ecpg-proj/annot $
ls -l
total 586436
-rw-rw-r-- 1 kord kord 90773183 Oct 1 11:14 annoEPIC.csv
-rw-rw-r-- 1 kord kord 31049598 Oct 1 14:43 annoEPIC.hg19.bed6
-rw-rw-r-- 1 kord kord 1592993 Oct 1 14:34 annoHT12.hg19.bed6
-rw-rw-r-- 1 kord kord 1237 Oct 1 14:42 createEPICHg19.R
-rw-rw-r-- 1 kord kord 13518 Oct 1 14:43 createEPICHg19.R.out
-rw-rw-r-- 1 kord kord 919 Oct 1 14:36 createHT12hg19.R
-rw-rw-r-- 1 kord kord 5142 Oct 1 14:34 createHT12hg19.R.out
-rw-rw-r-- 1 kord kord 469427673 Jan 10 2020 hg19.ensGene.gtf
-rw-rw-r-- 1 kord kord 7623902 Aug 12 2015 humanHt12v4.re-annotator.26426330.txt
Pushed files into demo/ in [main 8a6fa41]
How should the megabases be calculated between different loci? Each locum has a chromStart and chromEnd value. I can think of a few ways to calculate the distance between loci:
abs(end2 - end1) # Distance between ends
abs(start2 - start1) # Distance between starts
abs((end2 + start2) / 2 - (end1 + start1) / 2) # Distance between centers
Which one of these is preferred?
Also, are the score or stand values used in the region filtering?
The reference position will be the gene expression data chromStart value. We will define this position as the transcript start site. Ignore chromEnd.
To find the Mt loci for comparison for Cis and Distal, just identify those withing X bases of a given gene's chromStart value on the same chromosome. For trans loci, it's all of the Mt loci on a different chromosome.
As for the strand: The strand gives us the direction of the gene body with relation to the reference assembly. DNA has two strands, and one was arbitrarily selected as the reference strand. The strand only really matters when we need to consider flanking direction in our calculations. For what we are doing (i.e., just using equal distances from the chromStart values for both Mt and Gx data) this won't matter. If we cared about different distances for up versus down stream then we would have to consider it.
For example:
Implemented in bf696eb.
With regional annotations, we can filter the pairwise tests based on the genomic locations of the methylation locus and the gene locus.
Gene expression and methylation loci annotations will be provided in browser extensible data (BED) format.
The BED format is described here: https://genome.ucsc.edu/FAQ/FAQformat.html#format1
We will use a BED6 (i.e., the first four columns of the BED format description) design.
eQTM: expression quantitative trait methylation
Three eQTM mappings will be implemented:
The default ranges for the Cis-eQTM and Distal-eQTM are described above. Users will be able to set these values if desired.
The following annotation files have been create for the Illumina EPIC (methylation) and HT12 (gene expression) microarrays: