jianhong / ChIPpeakAnno

11 stars 4 forks source link

Questions on the how genomicElementDistribution works #15

Open Kyung-TaeLee opened 2 years ago

Kyung-TaeLee commented 2 years ago

hi @jianhong ,

First of all, thank you for providing such a nice tool! I am currently using the tool to analyze ATAC-seq and it works beautifully. I have some questions on the results from genomicElementDistribution.

In "genomicEelementDistribution", there are options named as "promoterRegion" and "geneDownstream". I wonder if promoterRegion is calculated based on the transcription start site (5'end of gene) of gene. Second, I wonder if either promoterRegion and geneDownstream option consider strand (+ or - strand) of gene.

so for example, let's say there is a gene name as "Tubulin" whose coordinates of start and end are 1000 and 2000 (start and end coordinates specified in GTF file), respectively, and strand is "-" and options for promoterRegion and geneDownstream are set as below

promoterRegion=c(upstream=500, downstream=500), geneDownstream=c(upstream=0, downstream=200))

If the strandness of gene is considered, promoter region would be 1500-2500 (5'end of Tubulin is 2000 considering that the strand is "-") and downstream region would be 800-1000 (3'end of Tubulin in 1000 considering that the strand is "-"). So it would be great if you can confirm that genomicElementDistribution consider the strandness of gene or not.

Lastly, as far as I understand, If the peaks overlap multiple features (exon, intron, etc), single feature is assigned to the peak based on the pre-determined priority. It would be great if you can tell me how the features are prioritized.

In my R code, I make txdb from gencode gene annotation using the makeTxdbFromGFF as shown below

R code

setwd(dir) library(ChIPpeakAnno) library(GenomicFeatures) gtf <- "gencode.v29.annotation.gtf" ## genocde annoataion txdb <- makeTxDbFromGFF('gencode.v29.annotation.gtf') test_peak_file <- "ATAC-seq_peaks.bed" test_peak <- toGRanges(test_peak_file, format="BED") out<- genomicElementDistribution(test_peak, TxDb = txdb, promoterRegion=c(upstream=5000, downstream=5000), geneDownstream=c(upstream=0, downstream=2000))

Please let me know if there is any flaw in the code. Thank you and have a nice day!

jianhong commented 2 years ago

Thank you for trying ChIPpeakAnno to annotate your data. Yes, the promoter region and downstream region will consider the strand information. The ignore.strand=TRUE is designed for the peaks to be annotated. About the priority, it is determined by the order of element in labels parameter. for example:

Exons=c(utr5="5' UTR", utr3="3' UTR", CDS="CDS", otherExon="Other exon") will behavior different from Exons=c(CDS="CDS", utr5="5' UTR", utr3="3' UTR", otherExon="Other exon")

And your codes looks good.