statgen / bamUtil

http://genome.sph.umich.edu/wiki/BamUtil
89 stars 30 forks source link

`bam diff` seems to return the reads that look identical between A and B #59

Open hisplan opened 4 years ago

hisplan commented 4 years ago

Hi,

I have two BAM files that I'd like to compare. Each is about 5.6GB. I expect them to be identical (I'm sort of doing a reproducibility test).

When I ran with the following command:

bam diff --in1 a.bam --in2 b.bam --all --onlyDiffs --recPoolSize -1 --out c.bam

It generated three files:

-rw-r--r-- 1    2373574 Jun 13 15:12 c.bam
-rw-r--r-- 1    1803478 Jun 13 15:12 c_only1_a.bam
-rw-r--r-- 1    1803105 Jun 13 15:12 c_only2_b.bam

I tried to see what actually differs between the two, but I think they look identical. My suspicion is maybe something to do with the muti-mapped reads. Do you have any idea how to resolve this?

samtools view c.bam | head -n1
:TAGATCGCAAATGGTA:CTTATCAAGCGC:;A00228:279:HFWFVDMXX:1:1229:8477:28792  0   1   629088  1   46M1I44M    *   0   0   GTCCGAACTAGTATCAGGCTTCAAAATCGAATACGCCGCAGGCCCCCTTCGCCCTATTCTTCATAGCAGAATACACAAACATTATTATAAT ,:FF,FFFFFF:,,FF:FFFFFFF,F::FFF,FF,FF:FFFF:FFF,:F:F,FFFFFF,F,FF:FFF,FFF:FFFFFFFFFFFFF:FFFFF AS:i:78 HI:i:4  NH:i:4  nM:i:3  ZC:Z:42M1I48M   ZT:Z:AS:i:78;HI:i:3;NH:i:4;nM:i:3
$ samtools view a.bam | grep -F ":TAGATCGCAAATGGTA:CTTATCAAGCGC:;A00228:279:HFWFVDMXX:1:1229:8477:28792"
:TAGATCGCAAATGGTA:CTTATCAAGCGC:;A00228:279:HFWFVDMXX:1:1229:8477:28792  0   1   629088  1   42M1I48M    *   0   0   GTCCGAACTAGTATCAGGCTTCAAAATCGAATACGCCGCAGGCCCCCTTCGCCCTATTCTTCATAGCAGAATACACAAACATTATTATAAT ,:FF,FFFFFF:,,FF:FFFFFFF,F::FFF,FF,FF:FFFF:FFF,:F:F,FFFFFF,F,FF:FFF,FFF:FFFFFFFFFFFFF:FFFFF NH:i:4  HI:i:3  AS:i:78 nM:i:3
:TAGATCGCAAATGGTA:CTTATCAAGCGC:;A00228:279:HFWFVDMXX:1:1229:8477:28792  0   1   629088  1   46M1I44M    *   0   0   GTCCGAACTAGTATCAGGCTTCAAAATCGAATACGCCGCAGGCCCCCTTCGCCCTATTCTTCATAGCAGAATACACAAACATTATTATAAT ,:FF,FFFFFF:,,FF:FFFFFFF,F::FFF,FF,FF:FFFF:FFF,:F:F,FFFFFF,F,FF:FFF,FFF:FFFFFFFFFFFFF:FFFFF NH:i:4  HI:i:4  AS:i:78 nM:i:3
$ samtools view b.bam | grep -F ":TAGATCGCAAATGGTA:CTTATCAAGCGC:;A00228:279:HFWFVDMXX:1:1229:8477:28792"
:TAGATCGCAAATGGTA:CTTATCAAGCGC:;A00228:279:HFWFVDMXX:1:1229:8477:28792  0   1   629088  1   42M1I48M    *   0   0   GTCCGAACTAGTATCAGGCTTCAAAATCGAATACGCCGCAGGCCCCCTTCGCCCTATTCTTCATAGCAGAATACACAAACATTATTATAAT ,:FF,FFFFFF:,,FF:FFFFFFF,F::FFF,FF,FF:FFFF:FFF,:F:F,FFFFFF,F,FF:FFF,FFF:FFFFFFFFFFFFF:FFFFF NH:i:4  HI:i:3  AS:i:78 nM:i:3
:TAGATCGCAAATGGTA:CTTATCAAGCGC:;A00228:279:HFWFVDMXX:1:1229:8477:28792  0   1   629088  1   46M1I44M    *   0   0   GTCCGAACTAGTATCAGGCTTCAAAATCGAATACGCCGCAGGCCCCCTTCGCCCTATTCTTCATAGCAGAATACACAAACATTATTATAAT ,:FF,FFFFFF:,,FF:FFFFFFF,F::FFF,FF,FF:FFFF:FFF,:F:F,FFFFFF,F,FF:FFF,FFF:FFFFFFFFFFFFF:FFFFF NH:i:4  HI:i:4  AS:i:78 nM:i:3
juanjocss commented 5 months ago

I don't know if this was resolved at any point, but I was looking at a similar question on my own.

I think this is a little bug on the diff tool on the multi-mapped reads as you guessed. Looking at the code https://github.com/statgen/bamUtil/blob/017721cc07948558395e4934ec10d0f91407c5eb/src/Diff.cpp#L577 you get that the "read from file 1" is tagged with the cigar flag from the mismatching "read from file 2" in the output (plus other annotations from the remaining columns).

i.e. following your example c.bam is telling you that the read with cigar 46M1I44M from a.bam is mismatched to read with cigar 42M1I48M from b.bam. However, after inspecting a.bam and b.bam you can see that both files have the pair of reads with the 42M1I48M and 46M1I44M cigars.

If c.bam is small enough, you could brutishly grep on the extracted sam files to clean it out. e.g. First, cut the left most 15 columns in c.sam and grep it on b.sam

$ samtools view b.bam > b.sam
$ samtools view c.bam | cut -d $'\t' -f1-15 > c-query.sam
$ grep -F -v -f c-query.sam b.sam > c.clean.sam