grailbio / bio

Bioinformatic infrastructure libraries
Apache License 2.0
74 stars 16 forks source link

PAM bqsr filter #3

Closed ghost closed 4 years ago

ghost commented 4 years ago

I see there’s a markdup filter for PAM files. Is there a bqsr filter as well?

EDIT: Sorry I misread the filter.go file. I mean to ask: There is some code for running filters over pam files. Is there any code for using the filter.go file for marking duplicates and base quality recal? Or is this out of the scope of the repo?

yipal commented 4 years ago

Hi, there is a tool for marking duplicates. I hope to have that pushed to github this week.

Unfortunately, I don't think we have anything for base quality recal as of now.

yipal commented 4 years ago

I've landed "doppelmark" our duplicate marking tool for PAM and BAM. https://github.com/grailbio/doppelmark I have not tried to build it from the github repo, but you're welcome to take a look.

ghost commented 4 years ago

Cool! Is it possible to split PAM shards onto different servers, run this tool on the available shards, and then reduce them back to a single PAM file in a mapreduce fashion?

yipal commented 4 years ago

Not out of the box. How big is your PAM file? doppelmark is pretty fast.

ghost commented 4 years ago

I have not converted it to PAM yet, but the BAM is over 300GB. I have a framework for moving shards around to different servers. If I take a PAM file and split it into chunks and run the tool on each chunk, and then use the PAM sort/merge tool as the reducer, would that work?

yipal commented 4 years ago

Sharding the file does not currently work because doppelmark expects to be able to resolve the read pairs within a single file.

I've run doppelmark on 300GB bam files, and it took perhaps an hour or less with a 40 core, 160GB machine. Do you have a large machine available?

ghost commented 4 years ago

Oh wow great. Yes I have a server this size. I'll give it a go. Thanks!

ghost commented 4 years ago

Actually one last thing: I am curious about

doppelmark expects to be able to resolve the read pairs within a single file

I was under the impression that PAM shards keep pairs together with a padding strategy. Am I wrong in thinking this?

yipal commented 4 years ago

PAM sorts by position, so read pairs are not adjacent in the file.

If you use doppelmark for your file, I would try it with --disk-mate-shards=512 or more since you might start running into memory problems depending on how many discordant reads you have.

ghost commented 4 years ago

Ok that makes sense. Thanks i'll give it a try now.