Closed ghost closed 4 years ago
Hi, there is a tool for marking duplicates. I hope to have that pushed to github this week.
Unfortunately, I don't think we have anything for base quality recal as of now.
I've landed "doppelmark" our duplicate marking tool for PAM and BAM. https://github.com/grailbio/doppelmark I have not tried to build it from the github repo, but you're welcome to take a look.
Cool! Is it possible to split PAM shards onto different servers, run this tool on the available shards, and then reduce them back to a single PAM file in a mapreduce fashion?
Not out of the box. How big is your PAM file? doppelmark is pretty fast.
I have not converted it to PAM yet, but the BAM is over 300GB. I have a framework for moving shards around to different servers. If I take a PAM file and split it into chunks and run the tool on each chunk, and then use the PAM sort/merge tool as the reducer, would that work?
Sharding the file does not currently work because doppelmark expects to be able to resolve the read pairs within a single file.
I've run doppelmark on 300GB bam files, and it took perhaps an hour or less with a 40 core, 160GB machine. Do you have a large machine available?
Oh wow great. Yes I have a server this size. I'll give it a go. Thanks!
Actually one last thing: I am curious about
doppelmark expects to be able to resolve the read pairs within a single file
I was under the impression that PAM shards keep pairs together with a padding strategy. Am I wrong in thinking this?
PAM sorts by position, so read pairs are not adjacent in the file.
If you use doppelmark for your file, I would try it with --disk-mate-shards=512 or more since you might start running into memory problems depending on how many discordant reads you have.
Ok that makes sense. Thanks i'll give it a try now.
I see there’s a markdup filter for PAM files. Is there a bqsr filter as well?
EDIT: Sorry I misread the filter.go file. I mean to ask: There is some code for running filters over pam files. Is there any code for using the filter.go file for marking duplicates and base quality recal? Or is this out of the scope of the repo?