vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.09k stars 193 forks source link

Generating 'MD' tag for vg BAM files #2018

Open prithikasritharan opened 5 years ago

prithikasritharan commented 5 years ago

I'm currently looking at the differences in the read alignments produced by vg and bwa. As the CIGAR strings don’t discriminate between matches and mismatches, I’ve written a Python tool which takes a CIGAR string and the mismatched bases from the MD tag and merges them into a string that precisely reflects the alignment, I call this a CIGAR2 string.

I can produce CIGAR2 strings for all bwa mapped reads but I can’t do the same for all of my vg mapped reads as the vg BAM files (produced with the vg map --surject-to bam parameter) don't contain the MD tag information. I'm able to generate MD tags for reads mapped against “flat” variation graphs that have just a single reference genome (using the samtools calmd function) but not for more complex, pan-genome variation graphs as calling samtools calmd would generate matches and mismatches for the MD tag against the reference sequence, rather than the path through the variation graph that each read aligned to. Would it be possible to implement a function in vg, similar to the calmd function in samtools, that can generate the MD tag from alignments against a variation graph to be included in the output BAM file?

adamnovak commented 5 years ago

I'm not sure this would quite make sense to do the way you want it done. When you produce a BAM file from vg, I'm pretty sure the CIGAR strings it generates are against the linear reference, and not against the path that the Alignment took through the graph. So while we could add support for generating the MD tag to vg, we would want to be generating the MD tag against the linear reference, and not against the path in the graph to which the read is aligned.

It sounds like what you really need is a way to produce non-surjected BAM from vg, which contains the CIGAR against the possible path in the graph that the read most closely matches, and the corresponding MD tag. Is that what you're looking for? That would be useful for counting matches and mismatches, but not for anything like variant calling, because the reads wouldn't be aligned against the linear reference.

An alternate approach would be to dump the GAM to JSON (vg view -aj mapped.gam), and look at the .path.mapping[].edit arrays for each alignment, which is more or less vg's CIGAR string equivalent. By looking at the edits, you can tell matches from mismatches. Matches have equal from_length (length in the reference graph) and to_length (length in the read) and no sequence, while mismatches have equal from_length and to_length with a sequence set. Indels will have different from_length and to_length values, and inserts specifically will have the inserted sequence.

prithikasritharan commented 5 years ago

Thank you very much @adamnovak for clearing that up. I just needed to construct an exact CIGAR string for read alignments so parsing out the information regarding matches, mismatches and indels from the JSON file means I don't need the MD tag. However, is there a way to distinguish any clipped bases in the alignments from the JSON file?

Also, how do I go about generating a non-surjected BAM from a variation graph?

adamnovak commented 5 years ago

Soft-clipped bases aren't distinguished specially; they'll just look like insertions that abut the end of the read. Since we can't actually have real insertions at the end of the read, any bases that look like that are soft-clipped.

We don't have a way to generate a non-surjected BAM at the moment. To implement it I guess we would need a new option to something (I guess to vg surject? Or maybe vg view?) and some logic to produce a BAM with those semantics.

On 12/11/18, psritharan notifications@github.com wrote:

Thank you very much @adamnovak for clearing that up. I just needed to construct an exact CIGAR string for read alignments so parsing out the information regarding matches, mismatches and indels from the JSON file means I don't need the MD tag. However, is there a way to distinguish any clipped bases in the alignments from the JSON file?

Also, how do I go about generating a non-surjected BAM from a variation graph?

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/vgteam/vg/issues/2018#issuecomment-446237101