Open indraniel opened 6 years ago
Hey @indraniel ,
The documentation on variant calling has gotten pretty fragmented as we've experimented with new patterns. In brief, there are two calling pipelines:
vg call
, which does samtools mpileup style calling. This pipeline has been testing extensively and is actually producing good variant calls. vg genotype
, which is an experimental caller that uses bubbles in the graph. The genotype pipeline is not nearly so developed as vg call
is.My recommendation for now is to try vg call
, rather than genotype. The chunk_call
script should be able to give you nice SNV/indel calls and may call small SVs correctly. It's just a wrapper around the chunk
and call
commands of vg, and if your genomes aren't that big you could even try just running call directly.
In the future we plan to refactor genotype to do all-in-one SNV/indel/SV calling, but that's quite a ways down the road.
If you try the chunk / call pipeline, please let me know how it goes here so I can update the wiki.
I have also had a hard time running whole genome calling. I don't think it should require anything more complicated than giving vg call the .gam and .xg index. We should rewrite these interfaces so that the complexity of chunking and sorting is hidden from the user, as it is really irrelevant for most applications. The chunk_call pipeline has its uses, but putting the glue logic in the python layer is only going to cause confusion.
Interesting. As a neophyte with vg
, it appears from the wiki pages that vg genotype
is more preferred over vg call
. I saw very little documentation about vg call
.
This is true, and confusing! We should fix this and decide on a preferred algorithm for the documentation.
Hello ,
@ekg @edawson I'm currently having the same question/confusion about the genotyping step. Since this is a bit old, I was wondering if you have any new recommendations about this process?
@indraniel , please if you tried them both or got a preference on your data, your feedback will be much appreciated.
There's now a wiki page that describes our most mature variant calling workflow: https://github.com/vgteam/vg/wiki/Whole-genome-calling-and-genotyping
I'm trying to loosely follow the NA24385 work as described in the biorxiv paper, and working with a whole genome variation graph wiki page and I'm now having trouble figuring out how to best do variant calling.
I've finally generated a variant graph and alignment/GAM file with our internally generated NA24385 sequence data based on following the "working with a whole genome variation graph" page. Now I am unclear as how to best proceed with variant calling.
The README states that
I find no example of directly using the
vg chunk
command on the wiki pages. I do see ascript/chunked_call
script that seems to be runningvg chunk
and doing additional things, but I am not clear if I should be doing any operations before or after when running the script.Based on the README, it appears that I should create a GAM index (via
vg index
) and filter secondary and ambiguous read mappings viavg filter
on the alignment GAM generated from "working with a whole genome variation graph" before I can runvg genotype
. I don't see that explicitly done in thechunked_call
script. When are those indexing and filtering operations needed? Is there an explicit example of using thechunked_call
script?On a high level, I feel that I should be doing things something similar to this:
but it is not clear to me if it is the right approach. Any advice or recommendations on how to proceed further with variant calling would be much appreciated.
Thanks in advance!