walaj / VariantBam

Filtering and profiling of next-generational sequencing data using region-specific rules
Other
74 stars 10 forks source link

Add QCfail flag instead of removing reads for maximum coverage #9

Closed chapmanb closed 6 years ago

chapmanb commented 7 years ago

Jeremiah; Thank you for VariantBam, it's a really useful tool and we're looking to incorporate it's maximum coverage filtering into bcbio (https://github.com/chapmanb/bcbio-nextgen). For variant and SV callers that don't downsample, avoiding reads at super high depth can help provide huge speed improvements on whole genome runs. There are collapsed repeats and single nucleotide runs (all As, all Gs) that can cause a lot of unnecessary compute in regions you won't get useful information from.

I'd like to integrate downsampling as an option but avoid throwing away the high depth reads so the final BAM file is still a full representation of the original fastqs. What would you thinking about making it possible to flag these reads with, say the QC fail check so they're still present in the BAM but will be ignored by variant callers? I'm trying to think up options that don't require having two BAM files (an original and processed) passed around. Thanks for considering this, or for any other options you might think work.

walaj commented 7 years ago

Hi Brad, Great news about wrapping into bcbio. This would definitely make this tool more accessible as part of larger pipelines.

I think your suggestion makes a lot of sense. I could add an option so that rather than discarding reads for any variant filter (coverage or otherwise), an alignment tag could be added. Another option would be to just flag them as QC fail, which would probably do the trick for most variant callers out of the box (rather than have them try and interpret a new alignment tag).

I'll give this a try over the weekend (with either flag or tag as an option). Shouldn't be more than a few lines of code to add this feature.

On Wed, Jun 7, 2017 at 6:15 AM, Brad Chapman notifications@github.com wrote:

Jeremiah; Thank you for VariantBam, it's a really useful tool and we're looking to incorporate it's maximum coverage filtering into bcbio ( https://github.com/chapmanb/bcbio-nextgen). For variant and SV callers that don't downsample, avoiding reads at super high depth can help provide huge speed improvements on whole genome runs. There are collapsed repeats and single nucleotide runs (all As, all Gs) that can cause a lot of unnecessary compute in regions you won't get useful information from.

I'd like to integrate downsampling as an option but avoid throwing away the high depth reads so the final BAM file is still a full representation of the original fastqs. What would you thinking about making it possible to flag these reads with, say the QC fail check so they're still present in the BAM but will be ignored by variant callers? I'm trying to think up options that don't require having two BAM files (an original and processed) passed around. Thanks for considering this, or for any other options you might think work.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/walaj/VariantBam/issues/9, or mute the thread https://github.com/notifications/unsubscribe-auth/AGmfiPALQZ9EPb9xBXed_EH7GAbQVtF0ks5sBng3gaJpZM4Nyfme .

chapmanb commented 7 years ago

Jeremiah; Brilliant, thanks so much for being willing to look at it. Flagging with QC fail would be perfect for my use since then we wouldn't need any downstream modifications on variant callers. Thanks again for taking this on.

walaj commented 7 years ago

Hi Brad, I just updated VariantBam to allow for a -Q option, which sets the QC fail bit to 1 if the read doesn't pass the filters rather than deleting it. I had to add a function to SeqLib as well, so you'd have to update SeqLib submodule before rebuilding.

Let me know if any issues with this. It works for me on a simple test case.

chapmanb commented 7 years ago

Jeremiah; Thanks so much for looking at this. This was working for standard filtering but in my tests did not appear to do the right thing for maximum coverage cutoffs. It didn't set QC fail for reads and also had a fall through which appears to write double records when downsampling. I sent a PR which I believe fixes the logic and appears to behave correctly in my tests. Thanks again for the work on this and let me know if anything looks off with the PR.

walaj commented 7 years ago

Thanks Brad for both testing and fixing the issue. I see what you did, and this looks logical to me now. Since it satisfies your coverage tests, I went ahead and merged it. I really appreciate the PR.

On Thu, Jun 29, 2017 at 2:47 PM, Brad Chapman notifications@github.com wrote:

Jeremiah; Thanks so much for looking at this. This was working for standard filtering but in my tests did not appear to do the right thing for maximum coverage cutoffs. It didn't set QC fail for reads and also had a fall through which appears to write double records when downsampling. I sent a PR which I believe fixes the logic and appears to behave correctly in my tests. Thanks again for the work on this and let me know if anything looks off with the PR.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/walaj/VariantBam/issues/9#issuecomment-312064914, or mute the thread https://github.com/notifications/unsubscribe-auth/AGmfiMs2z4QIOI-U_xUueLHyMKYZetFMks5sI_FcgaJpZM4Nyfme .

chapmanb commented 6 years ago

Jeremiah -- thanks again for merging this. I wanted to follow up and let you know we've integrated this into bcbio and now do max coverage downsampling as part of WGS alignments to avoid high depth regions. I appreciate you adding this, it's helping avoid a lot of edge cases issues with repetitive regions.