jsh58 / Genrich

Detecting sites of genomic enrichment
MIT License
182 stars 27 forks source link

Handle a subset of the reference genome during peak calling #90

Closed Simso86 closed 1 year ago

Simso86 commented 2 years ago

Hi,

I am working on an ATAC-seq data set (the same that I referred to in #89). I have mapped my reads to the human reference genome GRCh38 (without ALT contigs) using Bowtie2 with the following command:

bowtie2 -q -p 8 --local --very-sensitive -x GRCh38_noalt_as -1 sample_R1.fastq -2 "sample_R2.fastq -S sample.sam

Now the issue I have encountered is that, apart from the main chromosomes chr1-22, chrx, chrY and chrM, the GRCh38 genome also includes extra chromosomes related to unlocalized and unplaced sequences (with the "random" suffix and "chrUn" prefix, respectively). Many reads map to these extra chromsomes (in fact for some samples up to 75% of reads map here).

I have run Genrich on the name-sorted bam files using the following command (with triplicates):

Genrich -t sample1.bam,sample2.bam},sample3.bam -o sample.narrowPeak -f sample_pq.bdg -k sample_pileup.bdg -b sample.bed -r -j -q 0.05 -a 200 -v -e chrM,chrEBV -E hg38-blacklist.v2.bed.gz

This results in a good amount of called peaks (about 60k - 100k peaks) for all the samples. Also, while many reads is mapping to the extra chromsomes described above, only few called peaks are located on those chromosomes. However, I have been thinking that the presence of the "random" and "chrUn" reads may still skew the results a bit. They also makes the files much bigger and I am not really interested in looking at them anyway. Therefore, I descided to keep only the main chromosomes (excluding chrM) using samtools:

samtools view -@ 8 -b sample.bam chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22 chrX chrY > sample_subset.bam

I then run Genrich on these new bam files and obtain again many peaks (about 60k - 130k). However, I realized that this probably will render the genome length calculation and background pileup wrong. If I run:

samtools idxstats sample_subset.bam

I can see that the "random" and "chrUn" are still there as well as their length, but of course with no reads. This is also true for the chrEBV. According to the Genrich README, as well as your answers in #29 and #1, I have understood that because the SAM header in the bam file still contains all chromosomes with their entire lengths, Genrich still includes them in the genome length calculation, which will probably overestimate the enrichment on my selected chromosomes.

Question 1: is my interpretation above correct? Question 2: Whould you recommend me to alter the SAM header, or do the subsetting in another way? Question 3-4: Should I avoid subsetting the bam file and instead use the -e parameter in Genrich to get rid of all these extra chromosomes? If that is the way to do it, is it possible to point to a text file containing all these chromosomes instead of adding them comma separated to the command? (they are quite many)

Additional features similar to that suggested in #40 or a parameter/flag that automatically includes only the "main" reference chromosomes whould be appreciated.

Thanks again for a very valuable tool. Best regards, /Simon

jsh58 commented 2 years ago

Thanks for the questions. Answers below.

Question 1: is my interpretation above correct?

yes

Question 2: Whould you recommend me to alter the SAM header, or do the subsetting in another way?

use the -e and -E parameters of Genrich

Question 3-4: Should I avoid subsetting the bam file and instead use the -e parameter in Genrich to get rid of all these extra chromosomes? If that is the way to do it, is it possible to point to a text file containing all these chromosomes instead of adding them comma separated to the command? (they are quite many)

If you really aren't interested in the extra contigs, the easiest solution would be to use a reference genome that doesn't have the extra contigs.