The Variant Calling Megathread

It's time to talk about variant calling (again). While working on a bit of a side project we started imaging certain things we needed for really great variant calling on graphs. There's already some exploratory work going on. This can be a place to dump ideas on the subject.

Just an update from our end:

What variant calling that there is in the main vg repo isn't very useful. But in our (UCSC) forks, we've polished it up a bit and added a VCF exporter thanks to Adam. Once it's done and the graph bakeoff paper is submitted, we'll have to do a big merge to get it (and a few of Adam's other fixes) back into the main repo. I don't think any of this will overlap too much with what you're doing on structural variants, but it should be a useful baseline for comparison. Especially since we'll have it pretty well-evaluated for the paper.

On the mapping end, one thing we've learned in our evaluation is that it'd be really nice to have map qualities (I think there's an old issue or two on this). And better taking into account of insert sizes between paired reads in particular. Not having these often seems to put us at a disadvantage when comparing vg calls to those of existing pipelines..

On Fri, Feb 5, 2016 at 7:06 AM, Eric T. Dawson notifications@github.com wrote:

It's time to talk about variant calling. While working on a bit of a side project https://github.com/edawson/hpv_minION_analysis we started imaging certain things we needed for really great variant calling on graphs. There's already some exploratory work going on https://github.com/edawson/ggsv.git. This can be a place to dump ideas on the subject.

— Reply to this email directly or view it on GitHub https://github.com/ekg/vg/issues/193.

I heard from @edawson that he's on his way to superbubble detector integration.

The input to that will be a bit messy. I'm not sure the best way to go about things. From a conceptual level, it's nice to think about a graph which we integrate the reads into as paths, but that might be difficult to implement in reasonable memory even for small regions we work on in variant calling.

I agree that the current superbubble addition is awesome for small things but not feasible at scale. Ideally, I'd like to work off just the alignments and the XG index. I expect I'll have to walk the alignments to find mismatches to the graph, then use the XG (and some auxiliary structure) to match positions in the read to those in the graph and keep track of depth information. I can't imagine I could use much more than that on a large-ish region.

What sort of pairing information remains after alignment? If I remember right the Fragment proto provides a sort of array containing the related Alignments of the mates.

I've opened a pull request for supbubs but I've got a lot of manual merging left to do to bring it inline with vgteam/master.

I think we need to focus on the "small" case now. For any superbubble that is local to a given neighborhood, we only need to pick up that neighborhood to detect the superbubble. So, we can always take the XG index, extract a bit of the graph, and then detect superbubbles in it provided we have superbubble detection that works on VG. However, if we base things around XG then we will always need to construct the XG index for a given graph (also this isn't that bad but it seems more intensive a transformation than just using the VG class object).

My dream process would take a small part of the graph (small because I'm not dreaming big yet), edit it with the alignments and record the translation from each position in it back into the joint graph/alignment space of the input, then translate this into the SUPBUB input format (also recording another translation) and then detect superbubbles. The superbubbles we detect should then point back through the translation to the edited and/or original graph. If we're using the original graph, we can find the superbubbles through the alignments+graph in terms of graph positions via the translation map. Then, we cut the alignments (using functions in alignment.cpp, which need to be extended to handle quality btw) in order to make haplotype observations in a format suitable for freebayes. With a bit of hacking then we can put in all the other additional information about the read placement and so on that freebayes uses in its model, run the model search and find an optimal genotyping across as many samples as we want to examine simultaneously. This won't have any limitation as to ploidy or the number of alleles at the site, but it will require some surgery.

For larger stuff we probably do need to work from the XG index... there we could do the same step of converting into the SUPBUB format first. Then the rest looks remarkably similar. Actually, maybe we can take either an XG or VG without any real pain provided this translation.

So we've got several methods in development. Let's document them.

alignment filtering (remove low-frequency observations of alleles, consider what remains true)
pileup -> call (@glennhickey)
graph2vcf ? (@glennhickey and @adamnovak --- I don't know if this really exists, but you have something equivalent to deconstruct, no?)
deconstruct (@edawson --- basically this "calls" what's in the graph based on superbubbles and a given path)
GGSV (@edawson, what's up with this?)
haplotypes-through-superbubbles (my proposed approach)

This is a hard problem but solutions can be written quickly, so we should try everything we can.

For me, the ideal method for variant calling in graphs is:

scale invariant (SVs are called in the same way as SNPs)
generic for DAGs and cyclic graphs
efficient :smile:

@edawson and I had a long conversation in chat about this. I'll try to summarize here.

Erik Garrison, [08.04.16 15:05]
There is some generic pattern here that is common between the large and small cases
We should aim for that rather than a custom solution for every signal type
That's what I'm thinking toward

Eric Dawson, [08.04.16 15:05]
Right. I think it might work.

Erik Garrison, [08.04.16 15:06]
To handle the large cases
Maybe we can compress the graph

Erik Garrison, [08.04.16 15:08]
So we can do this in a weird way
The compression
We find a superbubble
Now we can imagine an operation
Which replaces all nested superbubbles with single nodes
Then we call paths through it
So that means we keep a positional transform mapping from the nodes in the nested superbubbles to nodes in the new graph
We can collect evidence for the amounts of reads traversing parts of the graph
And then use that to derive p(data) estimates
For the bits of the graph
And we can layer the genotype likelihoods on top of this
Then we can also go smaller
We now go to the next smallest nested superbubble and do the same operation

Eric Dawson, [08.04.16 15:14]
That would compress the graph down I guess
Essentially eliminates short variants and maintains long ones.

Erik Garrison, [08.04.16 15:14]
Well we don't want to think about every possible path
There are way too many
This nested superbubble compression will let us think about the varieties of large structures without the space explosion
Unless the graph is crazy

Eric Dawson, [08.04.16 15:16]
It's interesting certainly, it's like a reverse-contruction that decreases your small variant signal so that it's easier to see the large variant one.
I kind of want to try it.

Erik Garrison, [08.04.16 15:16]
That's exactly it
Some coordination of node removal and normalization will give the right result
It will be kindve complex
To handle the read data precisely

Erik Garrison, [08.04.16 15:17]
No sorry
What I said before about the projection
From the old to new id apace
Space
Is all we need
Then we stream back over the alignments and record their contribution to the right nodes in the final graph

Eric Dawson, [08.04.16 15:19]
Oooohh that's actually slick.
We'd have to do a bunch of simplification but it could work as an SV genotyper sort of thing.

Erik Garrison, [08.04.16 15:20]
Actually it will work at every level

Eric Dawson, [08.04.16 15:20]
Plus depth filtering works. It's a known variant caller.
Because of the translations?

Erik Garrison, [08.04.16 15:20]
It will be exactly the same to genotype a SNP, a small indel, a big indel, or a translocation

Eric Dawson, [08.04.16 15:22]
Ok, just going to check I understand. So you'd map from the nodes the reads hit (in original id space) to the new nodes defining a single superbubble in that space.

Erik Garrison, [08.04.16 15:24]
I'll try to draw

photo_2016-04-17_11-18-31

Erik Garrison, [08.04.16 15:27]
If so the idea is you would merge all of yellow into one node
And record a mapping from original -> new node id
We need to make sure the node id space doesn't conflict
But that's the idea
Now calling the big superbubble looks like calling the small one
We want to compress it as much as possible
In the limit this would even be helpful for debugging a big graph

Eric Dawson, [08.04.16 15:29]
Right. This seems like an easy transformation since the superbubbles only have indegree/outdegree one.
The ability to grab the nodes inside one is just a BFS (already in deconstruct).
So and we just need a map original node -> compressed node

Erik Garrison, [08.04.16 15:30]
There is some awkwardness with the translation because we may need to describe inversions
But otherwise yes that's it

Eric Dawson, [08.04.16 15:31]
Those are weird anyway since we DAGify
and unroll
And I am still torn how to handle that

Erik Garrison, [08.04.16 15:31]
Well you can keep layering the translation
There is a function that does it for you
Look at how it works in VG::align

Eric Dawson, [08.04.16 15:33]
In vg.cpp? Can't seem to find it

Erik Garrison, [08.04.16 15:33]
It is called from there
First dagify unfold
Then overlay node translation

Erik Garrison, [08.04.16 15:34]
I think this is the way
Insofar as
It provides the natural compression for variant calling at a particular scale

To follow onto the last comment, the idea is a recursive algorithm that compresses nested superbubbles into single nodes, then calls paths through the compressed graph.

This makes the genotyping algorithm look exactly the same regardless of which variant type we're working with. This may be something that we can exploit to write less code to do the genotyping.

I'll explain the two I've been working on.

Deconstruct calls variants present in the graph, first inserting Paths implied by Alignments using mod, then enumerating the superbubbles present in the graph, and finally using a breadth-first search (BFS) through each superbubble to transform to VCF. It works for small cases but is probably relatively inefficient (definitely in space, as the graph is copied to a supbub Graph before superbubbles are enumerated).

I've added the scrub command which gives a bunch of different filters for Alignments that are of interest in variant calling. Currently depth, average quality, soft-clip (pulled from Glenn's overhang filter), reversing path, percent identity to graph, and average quality (for Sanger PHREDs) are implemented. There is also a -v flag which inverts any filter (much like grep -v). I'm working on implementing average depth (calling it coverage, almost done), individual quality (almost done) and path divergence (e.g. for Alignments that span portions of multiple chromosomes). There is an extension of the path divergence filter that can catch cycles that @ekg proposed this morning; I'll write that up in another post. I also need to add an indel filter (where an insertion or deletion makes up a given portion of the Alignment). I'm hoping to finish the last few filters early this week.

So, to call variants (both SNV and SV), one can do a map | filter | mod -i | deconstruct pipeline. The filter and mod -i step(s) may be performed multiple times to select for specific Alignments that imply one type of variation (e.g. just split reads for translocations or just reads that imply no splits, reversions or divergences to capture only SNPs). You can then deconstruct everything out, although I think there are some limitations (I'm not 100% sure superbubbles will catch complex translocations if there is other variation around, and my superbubble->BFS->VCF code is still really rough).

ggsv does what the map | filter | mod -i | deconstruct pipeline does but without inserting the Alignments back into the graph, and it only seeks to call structural variants. It is directly analogous to the simplest current algorithms for structural variant calling:

Find some reads (here paired or banded) that don't map the way we expect
Label each discordant, split, reversing, cyclic or path-divergent read (or read pair).
Keep tabs of all of these and build up evidence supporting an SV at recurrent points.
Transform to VCF and yield a score (often just a T-test or something simple) at each putative SV.

The actual object should be just a container for a Filter and an xg index, with functionality to output some pseudo-vcf based on the coordinates of breakpoints in the Alignments and a map for positions to evidence. If we ever get a distance oracle for graph positions I'd add that as well so we could look for discordant reads pairs/bands. It might be nice to do some local realignment at putative breakpoints, too, but I'm getting way ahead of myself there. Since ggsv is mostly graph-less post alignment it should be compute-heavy but space-cheap.

To do the superbubble compression we need to rip out a lot of the supbub code or rewrite it to work only in vg-space. I'm working on this too but it hasn't been at the top of my list. I'm also concerned we will miss some types of SVs with this but I'll probably be proved wrong once it's written and running.

Our "graph2vcf" is here https://github.com/adamnovak/glenn2vcf. It takes as input a graph and copy number annotation file (both output by vg call), and makes a vcf.

It should get folded into vg at some point, and maybe replaced by or merged with, or at the very least compared to the super bubble stuff.

On Sun, Apr 17, 2016 at 5:16 AM, Erik Garrison notifications@github.com wrote:

So we've got several methods in development. Let's document them.

alignment filtering (remove low-frequency observations of alleles, consider what remains true)

pileup -> call (@glennhickey https://github.com/glennhickey)

graph2vcf ? (@glennhickey https://github.com/glennhickey and @adamnovak https://github.com/adamnovak --- I don't know if this really exists, but you have something equivalent to deconstruct, no?)

deconstruct (@edawson https://github.com/edawson --- basically this "calls" what's in the graph based on superbubbles and a given path)

GGSV (@edawson https://github.com/edawson, what's up with this?)

haplotypes-through-superbubbles (my proposed approach)

This is a hard problem but solutions can be written quickly, so we should try everything we can.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/vgteam/vg/issues/193#issuecomment-210983897

This sounds cool. Can you give me a command line that I can use to try out your deconstruct based pipeline? I'm curious to run it on the pilot data, which is 50X Illumina paired reads (101 bases per end, I think) if that's relevant.

As for an update about our end: We're still benchmarking our vg variant calls with respect to Platinum Genomes calls, as well as GATK, Platypus, FreeBayes etc. The biggest factor separating vg from these callers remains the alignments.

Adam's fix for the gssw dynamic programming was big step forward, but we're still seeing many missed / spurious calls that look linked to paired ends getting mapped too far apart. Adam's working on this now and has already hacked in something using the current id-space heuristics that's helped some, but the path distance support is obviously where we need to go.

On Sun, Apr 17, 2016 at 8:14 AM, Eric T. Dawson notifications@github.com wrote:

I'll explain the two I've been working on.

Deconstruct calls variants present in the graph, first inserting Paths implied by Alignments using mod, then enumerating the superbubbles present in the graph, and finally using a breadth-first search (BFS) through each superbubble to transform to VCF. It works for small cases but is probably relatively inefficient (definitely in space, as the graph is copied to a supbub Graph before superbubbles are enumerated).

I've added the scrub command which gives a bunch of different filters for Alignments that are of interest in variant calling. Currently depth, average quality, soft-clip (pulled from Glenn's overhang filter), reversing path, percent identity to graph, and average quality (for Sanger PHREDs) are implemented. There is also a -v flag which inverts any filter (much like grep -v). I'm working on implementing average depth (calling it coverage, almost done), individual quality (almost done) and path divergence (e.g. for Alignments that span portions of multiple chromosomes). There is an extension of the path divergence filter that can catch cycles that @ekg https://github.com/ekg proposed this morning; I'll write that up in another post. I also need to add an indel filter (where an insertion or deletion makes up a given portion of the Alignment). I'm hoping to finish the last few filters early this week.

So, to call variants (both SNV and SV), one can do a map | filter | mod -i | deconstruct pipeline. The filter and mod -i step(s) may be performed multiple times to select for specific Alignments that imply one type of variation (e.g. just split reads for translocations or just reads that imply no splits, reversions or divergences to capture only SNPs). You can then deconstruct everything out, although I think there are some limitations (I'm not 100% sure superbubbles will catch complex translocations if there is other variation around, and my superbubble->BFS->VCF code is still really rough).

ggsv does what the map | filter | mod -i | deconstruct pipeline does but without inserting the Alignments back into the graph, and it only seeks to call structural variants. It is directly analogous to the simplest current algorithms for structural variant calling:

Find some reads (here paired or banded) that don't map the way we expect

Label each discordant, split, reversing, cyclic or path-divergent read (or read pair).

Keep tabs of all of these and build up evidence supporting an SV at recurrent points.

Transform to VCF and yield a score (often just a T-test or something simple) at each putative SV.

The actual object should be just a container for a Filter and an xg index, with functionality to output some pseudo-vcf based on the coordinates of breakpoints in the Alignments and a map for positions to evidence. If we ever get a distance oracle for graph positions I'd add that as well so we could look for discordant reads pairs/bands. It might be nice to do some local realignment at putative breakpoints, too, but I'm getting way ahead of myself there.

To do the superbubble compression we need to rip out a lot of the supbub code or rewrite it to work only in vg-space. I'm working on this too but it hasn't been at the top of my list. I'm also concerned we will miss some types of SVs with this but I'll probably be proved wrong once it's written and running.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/vgteam/vg/issues/193#issuecomment-211008194

Are you collecting examples of these? I would love to see some.

If you map un-paired with bwa does freebayes, samtools or platypus make the same errors as vg call?

Mapping quality, which is used by the caller, is also essential. We don't have anything equivalent now.

Also, we don't use haplotype calling, which will tend to help a lot with misalignments.

I'm just pointing out that things are not apples to apples.

On Mon, Apr 18, 2016, 16:54 Glenn Hickey notifications@github.com wrote:

This sounds cool. Can you give me a command line that I can use to try out your deconstruct based pipeline? I'm curious to run it on the pilot data, which is 50X Illumina paired reads (101 bases per end, I think) if that's relevant.

As for an update about our end: We're still benchmarking our vg variant calls with respect to Platinum Genomes calls, as well as GATK, Platypus, FreeBayes etc. The biggest factor separating vg from these callers remains the alignments.

Adam's fix for the gssw dynamic programming was big step forward, but we're still seeing many missed / spurious calls that look linked to paired ends getting mapped too far apart. Adam's working on this now and has already hacked in something using the current id-space heuristics that's helped some, but the path distance support is obviously where we need to go.

On Sun, Apr 17, 2016 at 8:14 AM, Eric T. Dawson notifications@github.com wrote:

I'll explain the two I've been working on.

Deconstruct calls variants present in the graph, first inserting Paths implied by Alignments using mod, then enumerating the superbubbles present in the graph, and finally using a breadth-first search (BFS) through each superbubble to transform to VCF. It works for small cases but is probably relatively inefficient (definitely in space, as the graph is copied to a supbub Graph before superbubbles are enumerated).

I've added the scrub command which gives a bunch of different filters for Alignments that are of interest in variant calling. Currently depth, average quality, soft-clip (pulled from Glenn's overhang filter), reversing path, percent identity to graph, and average quality (for Sanger PHREDs) are implemented. There is also a -v flag which inverts any filter (much like grep -v). I'm working on implementing average depth (calling it coverage, almost done), individual quality (almost done) and path divergence (e.g. for Alignments that span portions of multiple chromosomes). There is an extension of the path divergence filter that can catch cycles that @ekg https://github.com/ekg proposed this morning; I'll write that up in another post. I also need to add an indel filter (where an insertion or deletion makes up a given portion of the Alignment). I'm hoping to finish the last few filters early this week.

So, to call variants (both SNV and SV), one can do a map | filter | mod -i | deconstruct pipeline. The filter and mod -i step(s) may be performed multiple times to select for specific Alignments that imply one type of variation (e.g. just split reads for translocations or just reads that imply no splits, reversions or divergences to capture only SNPs). You can then deconstruct everything out, although I think there are some limitations (I'm not 100% sure superbubbles will catch complex translocations if there is other variation around, and my superbubble->BFS->VCF code is still really rough).

ggsv does what the map | filter | mod -i | deconstruct pipeline does but without inserting the Alignments back into the graph, and it only seeks to call structural variants. It is directly analogous to the simplest current algorithms for structural variant calling:

Find some reads (here paired or banded) that don't map the way we expect

Label each discordant, split, reversing, cyclic or path-divergent read (or read pair).

Keep tabs of all of these and build up evidence supporting an SV at recurrent points.

Transform to VCF and yield a score (often just a T-test or something simple) at each putative SV.

The actual object should be just a container for a Filter and an xg index, with functionality to output some pseudo-vcf based on the coordinates of breakpoints in the Alignments and a map for positions to evidence. If we ever get a distance oracle for graph positions I'd add that as well so we could look for discordant reads pairs/bands. It might be nice to do some local realignment at putative breakpoints, too, but I'm getting way ahead of myself there.

To do the superbubble compression we need to rip out a lot of the supbub code or rewrite it to work only in vg-space. I'm working on this too but it hasn't been at the top of my list. I'm also concerned we will miss some types of SVs with this but I'll probably be proved wrong once it's written and running.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/vgteam/vg/issues/193#issuecomment-211008194

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/vgteam/vg/issues/193#issuecomment-211413676

Yeah, treating the reads as unpaired in bwa then running FreeBayes gives a fairly comparable accuracy to vg (ie much lower than FreeBayes with normal bwa). This holds for Platypus. On our examples, stripping out the mapping qualities from the BAMs doesn't affect FreeBayes's performance any (but it does for Platypus).

Examples abound, in particular, in the LRC_KIR region (chr19:54025633-55084318) which has some paralogous sequence scattered around. vg happily maps read ends miles apart (without ever running into its rescue code), whereas bwa is much more reluctant to do so because it takes pairing into account. I think this is very much an "apples to apples" comparison.

Adam's patch to always use some of the rescue heuristics helps significantly, epsecially on the primary graph. But the other graphs still have a lot of precision issues. Here's a single, but I think very typical example, where vg aligns ends 34kb apart to save on a mismatch aligning them 150bp apart.

BWA-MEM ERR194147.679991710 163 chr19 54860908 25 101M = 54861056 249 CAAGATGACTGTGGAGAGACGGAGAGCACACTGGGTACACAGGAAACTAAGGAGGAACAAGGAGTGTGTGTTTGACACTCACAGCCATTGGATTCACCTCG BB<BBBBBBBBBBBBBBBB<7BBBBBBBBB<BBBBBB<B<BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBFBBBBFBFBBBBBFBBFBBBBB< NM:i:3 MD:Z:12A41C9C36 AS:i:86 XS:i:91 XA:Z:ref,+801220,101M,2;ref,+819061,101M,3;ref,+754804,101M,3;ref,+723527,101M,3; ERR194147.679991710 83 chr19 54861056 25 101M = 54860908 -249 AGGGAGGCTCAGTTGCATAACTGGAATCTAGGAGACCGTGGAAAAGGCAATTGCCGCCCCACTGGTGAAATGTGGTGCTGATTTAGACACTAAATGAATGA BBBBFFBFBFFBBFBBBBBBFBFBBBBFBBBBBBBB<BBBBBBBBBBBBBBBBB<BBBBBBBBBBBBBBBBBBB<BBBBBBBBBBBBBBBBBBBBBBB<BB NM:i:0 MD:Z:101 AS:i:101 XS:i:96 XA:Z:ref,-723675,101M,1;ref,-754952,101M,2;ref,-819210,101M,4;

VG ERR194147.679991710/2 2 chr19 54826853 60 101M = 54826853 0CAAGATGACTGTGGAGAGACGGAGAGCACACTGGGTACACAGGAAACTAAGGAGGAACAAGGAGTGTGTGTTTGACACTCACAGCCATTGGATTCACCTCG BB<BBBBBBBBBBBBBBBB<7BBBBBBBBB<BBBBBB<B<BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBFBBBBFBFBBBBBFBBFBBBBB< ERR194147.679991710/1 18 chr19 54861056 60 101M = 54861056 0AGGGAGGCTCAGTTGCATAACTGGAATCTAGGAGACCGTGGAAAAGGCAATTGCCGCCCCACTGGTGAAATGTGGTGCTGATTTAGACACTAAATGAATGA BB<BBBBBBBBBBBBBBBBBBBBBBB<BBBBBBBBBBBBBBBBBBB<BBBBBBBBBBBBBBBBB<BBBBBBBBFBBBBFBFBBBBBBFBBFFBFBFFBBBB

Looking in the browser, for BWA-MEM, we see

Alignment of ERR194147.679991710 to chr19:54860908-54861008:

00000001 CAAGATGACTGTGGAGAGACGGAGAGCACACTGGGTACACAGGAAACTAA 00000050

|||||||||||| ||||||||||||||||||||||||||||||||||||| >>>>>>>> 54860908 caagatgactgtagagagacggagagcacactgggtacacaggaaactaa 54860957

00000051 GGAGGAACAAGGAGTGTGTGTTTGACACTCACAGCCATTGGATTCACCTC 00000100

|||| ||||||||| ||||||||||||||||||||||||||||||||||| >>>>>>>> 54860958 ggagcaacaaggagcgtgtgtttgacactcacagccattggattcacctc 54861007

00000101 G 00000101

| >>>>>>>> 54861008 g 54861008

and for VG

Alignment of ERR194147.679991710/2 to chr19:54826853-54826953: 00000001 CAAGATGACTGTGGAGAGACGGAGAGCACACTGGGTACACAGGAAACTAA 00000050

||||| |||||||||||||||||||||||||||||||||||||||||||| >>>>>>>> 54826853 caagacgactgtggagagacggagagcacactgggtacacaggaaactaa 54826902

00000051 GGAGGAACAAGGAGTGTGTGTTTGACACTCACAGCCATTGGATTCACCTC 00000100

|||| ||||||||||||||||||||||||||||||||||||||||||||| >>>>>>>> 54826903 ggagcaacaaggagtgtgtgtttgacactcacagccattggattcacctc 54826952

00000101 G 00000101

| >>>>>>>> 54826953 g 54826953

On Mon, Apr 18, 2016 at 12:41 PM, Erik Garrison notifications@github.com wrote:

Are you collecting examples of these? I would love to see some.

If you map un-paired with bwa does freebayes, samtools or platypus make the same errors as vg call?

Mapping quality, which is used by the caller, is also essential. We don't have anything equivalent now.

Also, we don't use haplotype calling, which will tend to help a lot with misalignments.

I'm just pointing out that things are not apples to apples.

On Mon, Apr 18, 2016, 16:54 Glenn Hickey notifications@github.com wrote:

This sounds cool. Can you give me a command line that I can use to try out your deconstruct based pipeline? I'm curious to run it on the pilot data, which is 50X Illumina paired reads (101 bases per end, I think) if that's relevant.

As for an update about our end: We're still benchmarking our vg variant calls with respect to Platinum Genomes calls, as well as GATK, Platypus, FreeBayes etc. The biggest factor separating vg from these callers remains the alignments.

Adam's fix for the gssw dynamic programming was big step forward, but we're still seeing many missed / spurious calls that look linked to paired ends getting mapped too far apart. Adam's working on this now and has already hacked in something using the current id-space heuristics that's helped some, but the path distance support is obviously where we need to go.

On Sun, Apr 17, 2016 at 8:14 AM, Eric T. Dawson < notifications@github.com> wrote:

I'll explain the two I've been working on.

Deconstruct calls variants present in the graph, first inserting Paths implied by Alignments using mod, then enumerating the superbubbles present in the graph, and finally using a breadth-first search (BFS) through each superbubble to transform to VCF. It works for small cases but is probably relatively inefficient (definitely in space, as the graph is copied to a supbub Graph before superbubbles are enumerated).

I've added the scrub command which gives a bunch of different filters for Alignments that are of interest in variant calling. Currently depth, average quality, soft-clip (pulled from Glenn's overhang filter), reversing path, percent identity to graph, and average quality (for Sanger PHREDs) are implemented. There is also a -v flag which inverts any filter (much like grep -v). I'm working on implementing average depth (calling it coverage, almost done), individual quality (almost done) and path divergence (e.g. for Alignments that span portions of multiple chromosomes). There is an extension of the path divergence filter that can catch cycles that @ekg https://github.com/ekg proposed this morning;

I'll write that up in another post. I also need to add an indel filter (where an insertion or deletion makes up a given portion of the Alignment). I'm hoping to finish the last few filters early this week.

So, to call variants (both SNV and SV), one can do a map | filter | mod -i | deconstruct pipeline. The filter and mod -i step(s) may be performed multiple times to select for specific Alignments that imply one type of variation (e.g. just split reads for translocations or just reads that imply no splits, reversions or divergences to capture only SNPs). You can then deconstruct everything out, although I think there are some limitations (I'm not 100% sure superbubbles will catch complex translocations if there is other variation around, and my superbubble->BFS->VCF code is still really rough).

ggsv does what the map | filter | mod -i | deconstruct pipeline does but without inserting the Alignments back into the graph, and it only seeks to call structural variants. It is directly analogous to the simplest current algorithms for structural variant calling:

Find some reads (here paired or banded) that don't map the way we expect

Label each discordant, split, reversing, cyclic or path-divergent read (or read pair).

Keep tabs of all of these and build up evidence supporting an SV at recurrent points.

Transform to VCF and yield a score (often just a T-test or something simple) at each putative SV.

The actual object should be just a container for a Filter and an xg index, with functionality to output some pseudo-vcf based on the coordinates of breakpoints in the Alignments and a map for positions to evidence. If we ever get a distance oracle for graph positions I'd add that as well so we could look for discordant reads pairs/bands. It might be nice to do some local realignment at putative breakpoints, too, but I'm getting way ahead of myself there.

To do the superbubble compression we need to rip out a lot of the supbub code or rewrite it to work only in vg-space. I'm working on this too but it hasn't been at the top of my list. I'm also concerned we will miss some types of SVs with this but I'll probably be proved wrong once it's written and running.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/vgteam/vg/issues/193#issuecomment-211008194

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/vgteam/vg/issues/193#issuecomment-211413676

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/vgteam/vg/issues/193#issuecomment-211464913

@glennhickey ok, awesome, that clarifies things for me. @adamnovak did you manage a fix and could we pull it into the repo (even if it's a hack)?

We have much better end pairing now, as well as vg genotype and VCF output from vg call. We might be done with this, with future work focusing on improvements.

Agreed. This was meant to be a catchall for SV, INDEL, and SNP calling and the requirements/issues/interactions between all of them. A concise, central description of the various methods would also be useful. This information probably belongs in the wiki at some point rather than hanging out in an issue.

On Tuesday, October 11, 2016, adamnovak notifications@github.com wrote:

We have much better end pairing now, as well as vg genotype and VCF output from vg call. We might be done with this, with future work focusing on improvements.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/193#issuecomment-252997223, or mute the thread https://github.com/notifications/unsubscribe-auth/ACeZUdAPGOH26_CxwA26o0idqIsN4VzEks5qy9B_gaJpZM4HUOTA .

Sent from my iPhone

vgteam / vg

The Variant Calling Megathread #193