PCR duplicate removal - Githubissues

elenichri commented 4 years ago

I am trying to set a pipeline for ATAC-seq analysis in our lab, using Genrich. I would like to remove the duplicates that were detected by Genrich (-R option) from the .bam files that were generated by bowtie2 in the earlier step.. I need this information in order to count the features (peaks) and use later in differential peak calling. However, the generated by Genrich pcrDup file is not a bam file (as one that would be generated by picard...How can I remove the duplicates in my case?

jsh58 commented 4 years ago

Thanks for the question. This is addressed in the README, in the description of the -R <file> at the end of this section.

elenichri commented 4 years ago

Thank you very much! I had missed it..

elenichri commented 4 years ago

I am sorry, it does not seem to work for multiple alignment. I have used Genrich in ATAC-Seq mode with two input samples (./Genrich -t f1,f2 .... etc ...) . The produced pcrDup file lists first the pcr duplicates in f1 first and then in f2.

How can I provide multiple input files and specify that the pcrDup file concerns all these files when I call the getReads.py script?
Currently I give as input only one of the two .bam files that are created by the call to Genrich. Please note that these are Qname sorted, not index sorted (is this correct?): qsub -pe smp 4 python ./getReads.py my_bam_with_duplicates.bam no pcrDup my_bam_without_duplicates.bam

The output my_bam_without_duplicates.bam file is created, but I cannot read it with samtools view. I get the error bam_header_read] invalid BAM binary header (this is not a BAM file). [main_samview] fail to read the header from "my_bam_without_duplicates.bam".

Could the problem be in the pcrDup file input? Do I need to generate a headers.txt file using the findDups.py script?

In addition, I would like to ask, does the input to getReads.py need to be a .sam file or .bam is ok too?

jsh58 commented 4 years ago

The input to getReads.py is a SAM file. To remove reads from a BAM file, you should run something like this:

$ samtools view -h <inBAM> | python getReads.py - no pcrDup - | samtools view -b - > <outBAM>

That command has to be run separately for each input file.

elenichri commented 4 years ago

Thanks for the quick reply! So I don’t have to split the pcrDup file in as many samples as I run genrich on? And then run the above command for each sample?

jsh58 commented 4 years ago

You don't have to split the pcrDup file. Any read headers in the pcrDup file that aren't in the input SAM file are ignored.

elenichri commented 4 years ago

Thanks for your answer, I will try asap.

elenichri commented 4 years ago

It worked, thank you very much for your help! Just to note, I ran the above on the index-sorted bam file, not on the qname sorted bam file that I used as input to Genrich. Is it ok?

jsh58 commented 4 years ago

Should be fine. getReads.py doesn't care about sort order.

elenichri commented 4 years ago

Thanks a lot!

jsh58 / Genrich

PCR duplicate removal #47