how to perform variant calling with a whole genome graph (NA24385 example)

indraniel commented 6 years ago

I'm trying to loosely follow the NA24385 work as described in the biorxiv paper, and working with a whole genome variation graph wiki page and I'm now having trouble figuring out how to best do variant calling.

I've finally generated a variant graph and alignment/GAM file with our internally generated NA24385 sequence data based on following the "working with a whole genome variation graph" page. Now I am unclear as how to best proceed with variant calling.

The README states that

Input must be split into chunks (see vg chunk) in order to run on whole genome.
...[snip]... To produce a VCF file for a whole chromosome, the graph must be cut up along the reference genome and called in chunks. scripts/chunked_call wraps this functionality to produce chromosome-sized VCFs in a single command line (from a GAM file and XG index)

I find no example of directly using the vg chunk command on the wiki pages. I do see a script/chunked_call script that seems to be running vg chunk and doing additional things, but I am not clear if I should be doing any operations before or after when running the script.

Based on the README, it appears that I should create a GAM index (via vg index) and filter secondary and ambiguous read mappings via vg filter on the alignment GAM generated from "working with a whole genome variation graph" before I can run vg genotype. I don't see that explicitly done in the chunked_call script. When are those indexing and filtering operations needed? Is there an explicit example of using the chunked_call script?

On a high level, I feel that I should be doing things something similar to this:

# Step 1 - filter secondary and ambiguous read mappings out of the gam
vg filter alignment.gam -r 0.90 -fu -s 2 -o 0 -D 999 -x graph.xg > filtered.gam

# Step 2 - create a GAM index
vg index -d mapped.gam.index -N filtered.gam

# Step 3a - run `vg chunk` or `chunk_call`???
???

# Step 3b - or directly make variant calls ???
vg genotype -v graph.vg mapped.gam.index/ > calls.vcf

but it is not clear to me if it is the right approach. Any advice or recommendations on how to proceed further with variant calling would be much appreciated.

Thanks in advance!

edawson commented 6 years ago

Hey @indraniel ,

The documentation on variant calling has gotten pretty fragmented as we've experimented with new patterns. In brief, there are two calling pipelines:

vg call, which does samtools mpileup style calling. This pipeline has been testing extensively and is actually producing good variant calls.
vg genotype, which is an experimental caller that uses bubbles in the graph. The genotype pipeline is not nearly so developed as vg call is.

My recommendation for now is to try vg call, rather than genotype. The chunk_call script should be able to give you nice SNV/indel calls and may call small SVs correctly. It's just a wrapper around the chunk and call commands of vg, and if your genomes aren't that big you could even try just running call directly.

In the future we plan to refactor genotype to do all-in-one SNV/indel/SV calling, but that's quite a ways down the road.

If you try the chunk / call pipeline, please let me know how it goes here so I can update the wiki.

ekg commented 6 years ago

I have also had a hard time running whole genome calling. I don't think it should require anything more complicated than giving vg call the .gam and .xg index. We should rewrite these interfaces so that the complexity of chunking and sorting is hidden from the user, as it is really irrelevant for most applications. The chunk_call pipeline has its uses, but putting the glue logic in the python layer is only going to cause confusion.

indraniel commented 6 years ago

Interesting. As a neophyte with vg, it appears from the wiki pages that vg genotype is more preferred over vg call. I saw very little documentation about vg call.

ekg commented 6 years ago

This is true, and confusing! We should fix this and decide on a preferred algorithm for the documentation.

Chaima-Bouchenak commented 2 years ago

Hello ,

@ekg @edawson I'm currently having the same question/confusion about the genotyping step. Since this is a bit old, I was wondering if you have any new recommendations about this process?

@indraniel , please if you tried them both or got a preference on your data, your feedback will be much appreciated.

jeizenga commented 2 years ago

There's now a wiki page that describes our most mature variant calling workflow: https://github.com/vgteam/vg/wiki/Whole-genome-calling-and-genotyping

vgteam / vg

how to perform variant calling with a whole genome graph (NA24385 example) #1613