Dears,
I found that bwa mem aligns more reads using more cores when an organism is in low abundance.
To reproduce this behaviour to a minimal example:
I created a fastq file consisting in 100,000 reads of Ecoli (acc: NC_000913.3), some of them randomly mutated.
I added 15 reads from Little cherry virus 1 (acc: NC_001836.1) as paired end reads (30 reads in total).
I used Little cherry virus genome as reference (attached lchv1.zip)
the command used: bwa mem -Y -M -R "@RG\tID:test\tSM:test\tPL:ILLUMINA\tPG:bwa" -t ${ncpus} lchv1.fna test_1.fq.gz test_2.fq.gz > ${ncpus}c.sam where ${ncpus} ranged from 1 to 3.
in theory only 30 reads should be reported as mapped reads due they are simulated from the same reference sequence. However, where using samtools view -Sh -q 50 -F 0x8 -f 0x2 -c ${ncpus}c.sam I got:
1 cores: 22
2 cores: 26
3 cores: 30
by checking the sam file for unmapped reads difference (26 reads with 2 core vs 30 reads with using 3 cores) I noticed that reads were marked with flags 97 and 145 which means read R2 is mapped on the reverse strand in the reference and that's why they were not properly paired. But, when using 3 cores, the same reads were properly aligned.
#sam file of run with 2 cores focusing on read id NC_001836.1-254
NC_001836.1-254 97 NC_001836.1 14694 60 150M = 14803 259 ATAGGAGAGACAAAATTAAAAGGAGATCCAATTGAAATAAAACAGATCACAAATCAAATAGCTATAGACAATTCACCTGTCGAGTCGAAACCCAACTATGTCTGGATATTTGAATATGCTAAAGATTGTGATCAGGTTATAACAACGATT CCCGGGGGGGGG1GGGJGJGJJJGJJ8GJJGJ(GJJJJGJCJJCJJGGCJGGG=JCJJGGGGGGJ=CGJG=C=GCGGGCGGGCGGG1GGGGG=GCGGGGG=CGCCGGGGCCGGGGGGGCG=GGCCC=GGGCGCG=GG=CGCGG=GGGCCG NM:i:0 MD:Z:150 MC:Z:150M AS:i:150 XS:i:0 RG:Z:lchv1
NC_001836.1-254 145 NC_001836.1 14803 60 150M = 14694 -259 TTGAATATGCTAAAGATTGTGATCAGGTTATAACAACGATTGCAAAAGGTTTGCAAACATCTATTACCGATGCTAAATTAGTTGTATTCCAAATGGCGATTTGTTGTGGTACTTCAATTGAATCTGTTCATGATAAACATACATCGTTAG CGGCCG8GGCGG=CCCGGG=GCGG=CGGGG=CGGCGCGCCCC=JCJCJGGCC=GGCGGGCGC=GGCGGGGCCJCGGJGGGGGGJGCGG=G=JGJGJGGGGJJJGG=1GJJGJGGGJC8JGJGJJCJGGJJJJJJJ1GGG=GGGGCGGCCC NM:i:0 MD:Z:150 MC:Z:150M AS:i:150 XS:i:0 RG:Z:lchv1
#sam file of run with 3 cores focusing on read id NC_001836.1-254
NC_001836.1-254 99 NC_001836.1 14694 60 150M = 14803 259 ATAGGAGAGACAAAATTAAAAGGAGATCCAATTGAAATAAAACAGATCACAAATCAAATAGCTATAGACAATTCACCTGTCGAGTCGAAACCCAACTATGTCTGGATATTTGAATATGCTAAAGATTGTGATCAGGTTATAACAACGATT CCCGGGGGGGGG1GGGJGJGJJJGJJ8GJJGJ(GJJJJGJCJJCJJGGCJGGG=JCJJGGGGGGJ=CGJG=C=GCGGGCGGGCGGG1GGGGG=GCGGGGG=CGCCGGGGCCGGGGGGGCG=GGCCC=GGGCGCG=GG=CGCGG=GGGCCG NM:i:0 MD:Z:150 MC:Z:150M AS:i:150 XS:i:0 RG:Z:lchv1
NC_001836.1-254 147 NC_001836.1 14803 60 150M = 14694 -259 TTGAATATGCTAAAGATTGTGATCAGGTTATAACAACGATTGCAAAAGGTTTGCAAACATCTATTACCGATGCTAAATTAGTTGTATTCCAAATGGCGATTTGTTGTGGTACTTCAATTGAATCTGTTCATGATAAACATACATCGTTAG CGGCCG8GGCGG=CCCGGG=GCGG=CGGGG=CGGCGCGCCCC=JCJCJGGCC=GGCGGGCGC=GGCGGGGCCJCGGJGGGGGGJGCGG=G=JGJGJGGGGJJJGG=1GJJGJGGGJC8JGJGJJCJGGJJJJJJJ1GGG=GGGGCGGCCC NM:i:0 MD:Z:150 MC:Z:150M AS:i:150 XS:i:0 RG:Z:lchv1
both sam files showed the same information varying only in the flag for these incorrectly marked flag(?) depending on how many cores were used.
Finally, I used version 0.7.17-r1188 and 0.7.17-r1198-dirty
Dears, I found that bwa mem aligns more reads using more cores when an organism is in low abundance. To reproduce this behaviour to a minimal example:
15
reads from Little cherry virus 1 (acc: NC_001836.1) as paired end reads (30 reads in total).bwa mem -Y -M -R "@RG\tID:test\tSM:test\tPL:ILLUMINA\tPG:bwa" -t ${ncpus} lchv1.fna test_1.fq.gz test_2.fq.gz > ${ncpus}c.sam
where ${ncpus} ranged from 1 to 3.in theory only
30
reads should be reported as mapped reads due they are simulated from the same reference sequence. However, where usingsamtools view -Sh -q 50 -F 0x8 -f 0x2 -c ${ncpus}c.sam
I got:by checking the sam file for unmapped reads difference (26 reads with 2 core vs 30 reads with using 3 cores) I noticed that reads were marked with flags
97
and145
which means read R2 is mapped on the reverse strand in the reference and that's why they were not properly paired. But, when using 3 cores, the same reads were properly aligned.both sam files showed the same information varying only in the flag for these incorrectly marked flag(?) depending on how many cores were used. Finally, I used version
0.7.17-r1188
and0.7.17-r1198-dirty
could you please deal with this issue?
Regards, Sandro test_1.fq.gz test_2.fq.gz lchv1.zip