isovic / graphmap

GraphMap - A highly sensitive and accurate mapper for long, error-prone reads http://www.nature.com/ncomms/2016/160415/ncomms11307/full/ncomms11307.html Note: This was the original repository which will no longer be officially maintained. Please use the new official repository here:
https://github.com/lbcb-sci/graphmap2
MIT License
178 stars 44 forks source link

MD:Z SAM file tag error: Reference sequence cross-contamination #90

Open meyermj opened 6 years ago

meyermj commented 6 years ago

This bug is in VN:v0.5.2 compiled on Nov 15 2017 at 11:37:33.

The MD:Z tag in the SAM output can be a very helpful supplement to the CIGAR string field because it details which bases have been inserted, deleted, and mismatched, whereas the CIGAR string only details the positions of indels. Because of the extra information supplied in the MD:Z field, one can re-create full alignments of reads to reference from the SAM file only (without having to re-read sequences from the reference genome). I however, I have noticed some irregularities in the MD:Z field produced by graphmap. Below I will detail how to reproduce the error I'm seeing.

I created a reference file of two unrelated, random sequences.

>ref0
TACGTACGTAGCTAGCTAGCTAGCTAGCTAGCTAGCTACGATCGATCGATCGATCGTTATTCGAGCGCGCGCGAAACGTATCGATCGTGCGCATCGTAGCTAGCTAGCTACGTACGTAC
>ref1
CGATGCTAGCTGATCGATCAAATTCACGTAGCTAGCTAGCTAGCTACTCGTAGCTACTGACGACTGACGTACTGACTGATCGATCGATCGAGCATCGATCGATCGATCGATCGATCGAC

I then created a single simulated read from ref1, with a single deletion of the 5th base (G).

>myRead
CGATCTAGCTGATCGATCAAATTCACGTAGCTAGCTAGCTAGCTACTCGTAGCTACTGACGACTGACGTACTGACTGATCGATCGATCGAGCATCGATCGATCGATCGATCGATCGAC

I used the following command to align myRead to the reference file containing both ref1 and ref2.

graphmap align -r refs.fasta -d read.fasta -o out.sam

Here is the SAM output:

myRead  0   ref1    1   40  4M1D114M    *   0   0   CGATCTAGCTGATCGATCAAATTCACGTAGCTAGCTAGCTAGCTACTCGTAGCTACTGACGACTGACGTACTGACTGATCGATCGATCGAGCATCGATCGATCGATCGATCGATCGAC  *   MD:Z:4^T114 NM:i:1  AS:i:582    H0:i:1  ZE:f:1.20556e-38    ZF:f:0.827464   ZQ:i:118    ZR:i:119

Of importance, you'll notice that the CIGAR string is correct, and the MD:Z field has the correct base counts (they both indicate 4 matches, 1 deletion, 114 matches), however, the base indicated as deleted in the MD:Z field is T, when it should be G. Coincidentally, T happens to be the 5th base of ref0, which made me suspect that the MD:Z is being constructed using the wrong reference. And in-fact, when I change the 5th base of ref0 to A, the MD:Z string changes to MD:Z:4^A114 to match ref0 (even though the read is mapped to ref1). Also, if I completely eliminate ref0 from my library, the correct MD:Z string is produced. Again, the composition or presence of ref0 in my library should have no bearing on the MD:Z field for my read, as it mapped to ref1. So, it appears that graphmap is using the wrong reference sequences to determine the bases that have been deleted for the MD:Z field.