lh3 / bwa

Burrow-Wheeler Aligner for short-read alignment (see minimap2 for long-read alignment)
GNU General Public License v3.0
1.54k stars 557 forks source link

reproducible bug: number of read mapped is proportional to number of cores #351

Open Sanrrone opened 2 years ago

Sanrrone commented 2 years ago

Dears, I found that bwa mem aligns more reads using more cores when an organism is in low abundance. To reproduce this behaviour to a minimal example:

in theory only 30 reads should be reported as mapped reads due they are simulated from the same reference sequence. However, where using samtools view -Sh -q 50 -F 0x8 -f 0x2 -c ${ncpus}c.sam I got:

1 cores: 22
2 cores: 26
3 cores: 30 

by checking the sam file for unmapped reads difference (26 reads with 2 core vs 30 reads with using 3 cores) I noticed that reads were marked with flags 97 and 145 which means read R2 is mapped on the reverse strand in the reference and that's why they were not properly paired. But, when using 3 cores, the same reads were properly aligned.

#sam file of run with 2 cores focusing on read id NC_001836.1-254
NC_001836.1-254 97      NC_001836.1     14694   60      150M    =       14803   259     ATAGGAGAGACAAAATTAAAAGGAGATCCAATTGAAATAAAACAGATCACAAATCAAATAGCTATAGACAATTCACCTGTCGAGTCGAAACCCAACTATGTCTGGATATTTGAATATGCTAAAGATTGTGATCAGGTTATAACAACGATT  CCCGGGGGGGGG1GGGJGJGJJJGJJ8GJJGJ(GJJJJGJCJJCJJGGCJGGG=JCJJGGGGGGJ=CGJG=C=GCGGGCGGGCGGG1GGGGG=GCGGGGG=CGCCGGGGCCGGGGGGGCG=GGCCC=GGGCGCG=GG=CGCGG=GGGCCG  NM:i:0  MD:Z:150        MC:Z:150M       AS:i:150        XS:i:0  RG:Z:lchv1
NC_001836.1-254 145     NC_001836.1     14803   60      150M    =       14694   -259    TTGAATATGCTAAAGATTGTGATCAGGTTATAACAACGATTGCAAAAGGTTTGCAAACATCTATTACCGATGCTAAATTAGTTGTATTCCAAATGGCGATTTGTTGTGGTACTTCAATTGAATCTGTTCATGATAAACATACATCGTTAG  CGGCCG8GGCGG=CCCGGG=GCGG=CGGGG=CGGCGCGCCCC=JCJCJGGCC=GGCGGGCGC=GGCGGGGCCJCGGJGGGGGGJGCGG=G=JGJGJGGGGJJJGG=1GJJGJGGGJC8JGJGJJCJGGJJJJJJJ1GGG=GGGGCGGCCC  NM:i:0  MD:Z:150        MC:Z:150M       AS:i:150        XS:i:0  RG:Z:lchv1
#sam file of run with 3 cores focusing on read id NC_001836.1-254
NC_001836.1-254 99      NC_001836.1     14694   60      150M    =       14803   259     ATAGGAGAGACAAAATTAAAAGGAGATCCAATTGAAATAAAACAGATCACAAATCAAATAGCTATAGACAATTCACCTGTCGAGTCGAAACCCAACTATGTCTGGATATTTGAATATGCTAAAGATTGTGATCAGGTTATAACAACGATT  CCCGGGGGGGGG1GGGJGJGJJJGJJ8GJJGJ(GJJJJGJCJJCJJGGCJGGG=JCJJGGGGGGJ=CGJG=C=GCGGGCGGGCGGG1GGGGG=GCGGGGG=CGCCGGGGCCGGGGGGGCG=GGCCC=GGGCGCG=GG=CGCGG=GGGCCG  NM:i:0  MD:Z:150        MC:Z:150M       AS:i:150        XS:i:0  RG:Z:lchv1
NC_001836.1-254 147     NC_001836.1     14803   60      150M    =       14694   -259    TTGAATATGCTAAAGATTGTGATCAGGTTATAACAACGATTGCAAAAGGTTTGCAAACATCTATTACCGATGCTAAATTAGTTGTATTCCAAATGGCGATTTGTTGTGGTACTTCAATTGAATCTGTTCATGATAAACATACATCGTTAG  CGGCCG8GGCGG=CCCGGG=GCGG=CGGGG=CGGCGCGCCCC=JCJCJGGCC=GGCGGGCGC=GGCGGGGCCJCGGJGGGGGGJGCGG=G=JGJGJGGGGJJJGG=1GJJGJGGGJC8JGJGJJCJGGJJJJJJJ1GGG=GGGGCGGCCC  NM:i:0  MD:Z:150        MC:Z:150M       AS:i:150        XS:i:0  RG:Z:lchv1

both sam files showed the same information varying only in the flag for these incorrectly marked flag(?) depending on how many cores were used. Finally, I used version 0.7.17-r1188 and 0.7.17-r1198-dirty

could you please deal with this issue?

Regards, Sandro test_1.fq.gz test_2.fq.gz lchv1.zip