lh3 / bwa

Burrow-Wheeler Aligner for short-read alignment (see minimap2 for long-read alignment)
GNU General Public License v3.0
1.54k stars 557 forks source link

Unexpected peak in insert size histogram #273

Closed chassenr closed 4 years ago

chassenr commented 4 years ago

Hi @lh3 I have noticed some weird insert sizes generated by bwa mem (default settings) followed by samtools sort and stats, when I map my 2x150bp Illumina metagenome reads against the megahit assembly (or reference genomes). In the histogram of the insert sizes, I find the expected bell-shaped distribution (mode as expected based on bioanalyzer output), but there is a also a peak at an insert size of 20bp, which should not occur, and a discrete jump in the frequency distribution at an insert size of 90 (which I assume is the realistic minimum insert size). Do you have an idea what could have caused this peak? Could these be false hits? Should I change the scoring parameters (if yes, any suggestion how)? Do you know, if and how these unexpected hits influence downstream analyses, e.g. binning? Should I remove these mappings from the sam file prior to further analysis steps (if yes, can you suggest a way to do so)?

Sorry for the long list of questions and thank you very much!

Cheers, Christiane

PS: Happy to provide samtools stats output if required.

yfarjoun commented 4 years ago

I suspect that this peak is due to bacterial contamination.

The "bacterial" reads can be identified by the tell-tall sign of both reads being soft-clipped from both ends leaving the same 20-23bp aligned. this normally happens in giant "pileups" that can throw-off variant calling and some metric collection.

What we do to alleviate it is to use Picard's MergeBamAlignment after calling BWA. MergeBamAlignment has several options regarding how to deal with reads that are suspected to come from bacterial origin.

Also if you want to try to identify your contamination, here's a plug: https://gatkforums.broadinstitute.org/gatk/discussion/23205/cross-species-contamination-identification-with-pathseq (full disclosure: I wrote that blogpost)

Cheers!

On Tue, Mar 10, 2020 at 6:36 PM Christiane Hassenrück < notifications@github.com> wrote:

Hi @lh3 https://github.com/lh3 I have noticed some weird insert sizes generated by bwa mem (default settings) followed by samtools sort and stats, when I map my 2x150bp Illumina metagenome reads against the megahit assembly (or reference genomes). In the histogram of the insert sizes, I find the expected bell-shaped distribution (mode as expected based on bioanalyzer output), but there is a also a peak at an insert size of 20bp, which should not occur, and a discrete jump in the frequency distribution at an insert size of 90 (which I assume is the realistic minimum insert size). Do you have an idea what could have caused this peak? Could these be false hits? Should I change the scoring parameters (if yes, any suggestion how)? Do you know, if and how these unexpected hits influence downstream analyses, e.g. binning? Should I remove these mappings from the sam file prior to further analysis steps (if yes, can you suggest a way to do so)?

Sorry for the long list of questions and thank you very much!

Cheers, Christiane

PS: Happy to provide samtools stats output if required.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/lh3/bwa/issues/273?email_source=notifications&email_token=AAU6JUTM4C6BYTVZWJDTTA3RGZUB3A5CNFSM4LFDCO52YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4IT6OTIA, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAU6JUVSVDON4BNQOZADU53RGZUB3ANCNFSM4LFDCO5Q .

chassenr commented 4 years ago

Thanks @yfarjoun . However, I am actually using bwa mem in a workflow to bin prokaryotic genomes. The metagenomes are from bacterioplankton samples with >70% bacterial reads. I started using bwa mem because it is implemented in metaWRAP, which I used for some steps in the analysis. Is bwa mem not suitable to map bacterial reads? I was not aware of this.

yfarjoun commented 4 years ago

could it be that there's contamination by a very different bacteria in your sample? it would result in the same thing....

On Tue, Mar 10, 2020 at 7:43 PM Christiane Hassenrück < notifications@github.com> wrote:

Thanks @yfarjoun https://github.com/yfarjoun . However, I am actually using bwa mem in a workflow to bin prokaryotic genomes. The metagenomes are from bacterioplankton samples with >70% bacterial reads. I started using bwa mem because it is implemented in metaWRAP https://github.com/bxlab/metaWRAP, which I used for some steps in the analysis. Is bwa mem not suitable to map bacterial reads? I was not aware of this.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/lh3/bwa/issues/273?email_source=notifications&email_token=AAU6JUTCSVWWLRSHTBU5PT3RGZ33JA5CNFSM4LFDCO52YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEOMOAOQ#issuecomment-597221434, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAU6JUTLXPEN7HYCQL3JYXLRGZ33JANCNFSM4LFDCO5Q .

chassenr commented 4 years ago

As I am working with metagenomes from environmental samples, I expect a high diversity (>2000 OTUs based on dada2-processed 16S amplicons). So you are right, there will be some very different bacteria in the sample (but those won't be a contamination). Thanks for the explanation.

chassenr commented 4 years ago

As I cannot get rid of the sequences that cause the peak at 20bp insert size before the mapping, I was wondering if you knew how this may affect downstream processing, specifically differential coverage binning? I check the MAPQ and sam FLAGs and noticed that (i) also proper pairs have an insert size of ~20bp and (ii) while the MAPQ is generally much lower for the pairs which have insert sizes of ~20bp, there are also some with high MAPQ. Therefore, even if only proper pairs are used in downstream analysis, this would not exclude the weird mappings. (If this question is better suited to the issue tracker of metabat2/concoct/etc. I will post the question there and link this issue.)

Thanks!

yfarjoun commented 4 years ago

I don't think that the presence of these reads should affect the mapping of the other reads much, so you can just filter them out after mapping...for example by using MergeBamAlignment...but you can roll your own.

On Wed, Mar 11, 2020 at 11:08 AM Christiane Hassenrück < notifications@github.com> wrote:

As I cannot get rid of the sequences that cause the peak at 20bp insert size before the mapping, I was wondering if you knew how this may affect downstream processing, specifically differential coverage binning? I check the MAPQ and sam FLAGs and noticed that (i) also proper pairs have an insert size of ~20bp and (ii) while the MAPQ is generally much lower for the pairs which have insert sizes of ~20bp, there are also some with high MAPQ. Therefore, even if only proper pairs are used in downstream analysis, this would not exclude the weird mappings. (If this question is better suited to the issue tracker of metabat2/concoct/etc. I will post the question there and link this issue.)

Thanks!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/lh3/bwa/issues/273?email_source=notifications&email_token=AAU6JUSRJLANBDGBSYN2A63RG5IHZA5CNFSM4LFDCO52YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEOOXAWY#issuecomment-597520475, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAU6JUVNPIGRQ5KJCY2HRODRG5IHZANCNFSM4LFDCO5Q .