vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.1k stars 194 forks source link

Split-reads: the default behavior of `vg map`, and general GAM question #2193

Open evanbiederstedt opened 5 years ago

evanbiederstedt commented 5 years ago

Hi there

I've been discussing split-reads with @adamnovak and @glennhickey; it may be best to raise this question in a github issue for consensus. Perhaps this will motivate more constructive discussion.

(1) what is the current behavior of vg map when it comes to split reads? Do reads get split by default?

As @adamnovak pointed out, there is an is_secondary flag but no is_supplementary flag.

https://github.com/vgteam/vg/blob/master/src/vg.proto#L124

(2) How would split-reads by represented in a GAM? I think the rough consensus is (to paraphrase)

A split read on a graph would come up as an alignment in the GAM that spans an edge that's not in the graph. So two nodes appear consecutively in the Path object that aren't connected in the underlying graph. One would represent split reads by just allowing the Path in the read's Alignment object to make jumps not following edges in the graph.

(3) How would vg stats currently deal with such objects? Are there any other tools/recommended methods to look at or quantify these?

Thanks for the help!

edawson commented 5 years ago

In the (now pretty distant) past, I would mostly see soft-clipping if a read covered a big structural variant that would be indicated by a split read, especially if the variant in question was larger than a single node. When I did see them in the GAM I vaguely remember them having big edits in their paths.

The vg sift command should handle pulling these out of a GAM file. If you've got test data I'm happy to peek at it, and if you need some I have some as well.

ekg commented 5 years ago

Reads are not split by default, but this may be done manually using the -w parameter to vg map. If this is set below the read length, then the read is aligned in chunks and the resulting alignments are threadings through the multimappings of each chunk. The precise boundaries of each chunk are determined dynamically at alignment time through "patching" of the partial alignment against the graph. Although this sounds complicated, it works quite well in practice for longer reads.

We should probably move to producing an output akin to PAF, wherein "split" reads are the default setting. The mapper itself would need to change for this to occur. I don't understand the difference between secondary and supplementary mappings. I would prefer not to go down that road, and rather to invest in a data format (like PAF) that makes this distinction irrelevant.

evanbiederstedt commented 5 years ago

Hi @edawson

if you need some I have some as well.

This could be useful, yes! I'm curious what vg sift will give. Thanks!

astatham commented 5 years ago

I don't understand the difference between secondary and supplementary mappings.

Hi @ekg Secondary are where the (whole, or partial) read can map to multiple locations (eg repeats in the genome) Supplementary are where the read can be split and the two parts map to separate locations (eg structural variation not captured in the reference)