lh3 / bwa

Burrow-Wheeler Aligner for short-read alignment (see minimap2 for long-read alignment)
GNU General Public License v3.0
1.55k stars 556 forks source link

explanation of insert sizes from BWA-MEM #113

Open roblanf opened 7 years ago

roblanf commented 7 years ago

I have noticed that insert sizes from BWA-MEM seem to start high at the beginning of a chromosome, and drop down towards the end. This isn't really an issue (feel free to close and ignore) but I wanted to put it out here to see if anyone has a good explanation.

I'm mapping 100bp PE Illumina data to a reference genome from another species. The reference (Eucalyptus grandis) is ~5% diverged from the study species. It has 11 good chromosomes, and a bunch of rubbish scaffolds. Gene order etc. is well conserved between the two.

I'm running BWA-MEM and looking at the output with Qualimap like this:

bwa mem $ref $R1 $R2 -t $threads > out.sam
qualimap bamqc -bam out.bam -outdir $outBWAMEM"qualimap_all/" -nt $threads -c

What I don't understand is the plot of insert sizes. My expectation is that most insert sizes will be ~200bp. But what I observe from BWA-MEM looks very different.

genome_insert_size_across_reference

nevertheless, the insert-size histogram looks fine, and looking at the mapping in IGV looks fine too.

genome_insert_size_histogram

So, it looks like the start of chromosomes must contain a small number of pairs with enormous insert sizes. This result is not shared by other mappers like NextGenMap, bbmap, or Stampy (all run with default params). They all look more like the plot below (this from NextGenMap): genome_insert_size_across_reference

I'm struggling to understand the difference.

lh3 commented 7 years ago

I always run htsbox bamshuf or samtools collate if the input reads come from a coordinate-sorted BAM. Everyone should do this, or provide the insert size distribution to bwa-mem.

roblanf commented 7 years ago

Any chance you could explain how this would affect the insert size distribution, and why you think the insert size distribution looks so odd before doing this?

My understanding is that both of these things move reads around in the BAM file, but that neither should affect the distribution of insert sizes, which are calculated from the paired reads themselves.

I tried samtools collate but Qualimap insists on a sorted bam (which I just do with samtools sort usually).

lh3 commented 7 years ago

I wasn't clear: when you generate fastq, you should not generate from a sorted bam (actually don't use Picard tools to generate fastq!). samtools collate should be run before generating fastq.

maarten-k commented 7 years ago

(actually don't use Picard tools to generate fastq!).

Excuse my ignorance, but why is picard bam2fastq a bad idea? I use this (after picard RevertSam SORT_ORDER=queryname|picard MarkIlluminaAdapters) as well as the Broad WDL realingment pipeline do use

lh3 commented 7 years ago

why is picard bam2fastq a bad idea

Because 1) many naively use it to generate fastq from sorted bam, which is impropriate for bwa; 2) if you sort by query name first, it is slow. Samtools collate+fastq is times faster and the preferred choice.

ekg commented 7 years ago

Thanks, this was very helpful to know.

On Tue, Mar 7, 2017 at 3:36 PM Heng Li notifications@github.com wrote:

why is picard bam2fastq a bad idea

Because 1) many naively use it to generate fastq from sorted bam, which is impropriate for bwa; 2) if you sort by query name first, it is slow. Samtools collate+fastq is times faster and the preferred choice.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/lh3/bwa/issues/113#issuecomment-284738631, or mute the thread https://github.com/notifications/unsubscribe-auth/AAI4ESFJWUAvlrdB7KMtpneLYRro2HqIks5rjWuDgaJpZM4MU9jH .

roblanf commented 7 years ago

OK, now I'm totally lost! The fastq's I'm using here come straight from a hiseq, with no pre-processing at all.

All I'm doing is this:

  1. Map forward and reverse fastq files with BWA-MEM
  2. SAM to BAM with samtools
  3. Sort and index BAM with samtools
  4. Qualimap to generate plots of insert sizes

Maybe there's something fundamental that I'm missing here, but my impression is that Qualimap is just reading off the positions of the two reads from the BAM produced by BWA-MEM.

lh3 commented 7 years ago

Ok. My previous explanation was wrong then. Do you have the bwa-mem stderr output?

lh3 commented 7 years ago

A wild guess. Once two reads in a pair are too far apart (around 500-600 according to your plot, but you should check the bwa-mem stderr log for sure), bwa-mem will treat them as independent reads - no read pair constraints are applied at all. If one of the two reads is a repeat, it could get mapped everywhere in the genome. Other mappers may have different behaviors. For example, they could try to place the two reads as close as possible. Or if you use a large max insert size, e.g. 1000, in these other mappers, the practical effect will be similar.

roblanf commented 7 years ago

This sounds pretty reasonable. My guess so far (I'll need to do some digging to check) is that these things are happening:

  1. My reference has more repeats at the start of the chromosomes
  2. BWA is mapping as you say
  3. Qualimap (which I'm using to make the plots) is perhaps ignoring flags in the BAM that BWA puts in to tell it which reads were mapped without constraints

If these things are all happening, I think it would explain it. If Qualimap is treating flags from other mappers properly (i.e. not using insert sizes on reads that were mapped without constraints), then that would explain the apparent differences.

Looking at the data in IGV, I can't see anything particularly different about the mapping from BWA.

kemin711 commented 5 years ago

For my data, if --R1--> <--R2-- are mapped like this on the reference (+/-) I understand how the insert size is calculated from the left most of R1 to the right most of R2. When <--R1-- <--R2-- are both on the negative strand of the reference the insert size is the left point of R2 - left point of R1 + 1. This is an IMPROPERLY mapped pair. Will it be better to calculate it from the right most of R2 to the right most of R1. This would reflect the boundary of the reads (starting point of both reads). As an example, I have pos 331471, mate pos 243213810, insert size 242882340. The reason to use the start instead of the end of both read1 and read2 is that, the end is not a fixed point with regards to the sequencer. For the same molecule, the sequencer may stop at 151 for one PCR copy, for the next, it may stop at 148 due to some reason (it will usually give you 151 reads, but the last 3 may be garbage). If you use the starting point I can figure out that they belong to the same molecule. But if you use the end point, then I am not that sure. Of course, the starting point may also be fuzzy, but at least it is more sure than the ending point.

vinod1981 commented 4 years ago

@roblanf did you get an answer of this weird distribution which I am also getting alike. I can't even guess whta is happening here. Thanks,

roblanf commented 4 years ago

@vinod1981 nope. I still have absolutely no idea what was happening. Out of interest, what is the genome like for your species? One thing that strikes me is that my genome is fairly heterozygous (~3%) and repeat rich (~40%). I wonder particularly if the latter might be part of it - I never thought at the time to check where the reads with large insert sizes were mapping, but if the two reads were mapping to repeats and there was some kind of bias involved in what one would have expected to be random placement, then maybe this could help explain it?

rturba commented 3 years ago

I'm also seeing the same pattern with my genomes and they are all mapped to the same species. Also started from fastq files and I have been following the GATK best practices. Running qualimap after marking duplicates.

fritzsedlazeck commented 3 years ago

Here is a wild guess.. Does it correlate with repeats? I noticed that BWA mem increases MQ if the paired end reads are within the expected distance. Thus maybe pulling pairs inside of repeats together... dont know need to think about it.. Does it hold up if you filter for MQ ?

fritzsedlazeck commented 3 years ago

Nah doesn't make much sense. I would assume that Chr1 has more repeats...

rturba commented 3 years ago

I don't see this pattern when I map my genomes using bowtie2...

plhm commented 3 years ago

To add to this dicussion, I see the exact same thing aligning reads to the Anolis carolinensis reference genome. When using Stampy or NextGenMap this effect goes away. I believe this is in part the reason why when I run samtools flagstat I see a low percentage of properly paired reads. When bwa does it's expected insert size calculation it becomes heavily biased by outliers, which leads the software to think that the expected distance between reads is orders of magnitude larger than what it is supposed to be, leading it to flag a large number of reads as improperly paired.