vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.12k stars 194 forks source link

Variant calling from genome graph directly #1864

Open SAMtoBAM opened 6 years ago

SAMtoBAM commented 6 years ago

I feel there must be an easy answer to this query but I am unable to see it at present.

I am attempting to create a graph (currently with just two genomes) and then use the graph to create a vcf file, the idea being that a vg file can be created, updated incrementally with new genomes and have a new vcf file produced as necessary.

However due to what has been noticed in previous threads, the limited documentation of 'vg call' means I can not understand how it is possible. In other examples they have generally used a gam index after aligning reads...

So currently:


vg construct --reference genome_ref.fa > graph_v0.vg
vg index \
-x graph_v0.xg \
-g graph_v0.gsca -k 11 \
graph_v0.vg
vg msga -g graph_v0.vg -f new_genome.fa > genome_graph_v1.v

```g

Thanks
edawson commented 6 years ago

Hey @PhDstuff ,

Getting variation out of the graph directly is more complicated than it seems at first, though it looks like you're using fully assembled genomes so it might make more sense. There used to be a command for this (deconstruct), but I don't think it can be trusted at this point. It's on my todo list of things to refresh, but for now it resides on the hidden CLI.

Is this a use case you see as common in a particular field, and could you provide some test data?

ChriKub commented 6 years ago

What you could try is to export your vg graph as a gfa file using vg view and thread it through reveal variants. Its not ideal but a possible workaround for now.

SAMtoBAM commented 6 years ago

Hello @edawson,

Oh right, so the vg call isn't appropriate for this? and deconstruct works directly (or aims to) on the vg file?

Well we envision using this as a new reference, which would therefore require incremental additions (as new genomes become available) and the ability to look at the variation within the graph.

Currently, as we are in the process of genome assembly and polishing, I am trying to rework the example performed in the vg publication using the Yue et al. yeast genomes, using in particular the cerevisiae strains. Would this dataset still be useful for this test?

SAMtoBAM commented 5 years ago

Hello @edawson Have you seen any progress in this area? It would be interesting to me to use vg as a means to call variation between two genome alignments and currently compiling a comparison of available tools. Thank you

glennhickey commented 5 years ago

vg call makes a VCF from a graph and GAM, assuming a diploid sample. You may be able to abuse it to make a graph from a pairwise alignment with something like (please forgive typos, I haven't run this):

# Extract your paths as a GAM
vg paths genome_graph_v1.vg -X > paths.gam

# call everything
vg augment genome_graph_v1.vg paths.gam -a pileup -Z translation -S support -g 0 -q 0 -M > aug.vg
vg call aug.vg -z translation -s support -r <reference path> -n 0 > output.vcf

Another alternative would be simulating ~30X error-free reads for your graph with vg sim -a and calling on that. As alluded to by others, VCF isn't a great format for representing general graphs. But I think we have the machinery in vg to have something closer to what you want via the snarls API.