vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.12k stars 194 forks source link

What is the best way to generate VCF for vg-toolkit given raw sequences #3697

Open sumit-walia opened 2 years ago

sumit-walia commented 2 years ago

I am trying to generate variation graph (vg) from raw SARS-CoV2 sequences (~16k sequences). What would be the best way to generate VCF given these raw sequences? And, does the method works as the data scales up?

adamnovak commented 2 years ago

We used to support vg msga for making graphs from sequences, but it doesn't scale that high in sequence count.

You could try making a graph with Minigraph, or Minigraph and Cactus together, though I don't know if that would scale enough either. @glennhickey might be able to guess.

You could also throw @ekg's PGGB tool at the problem.

If you want to go sequences -> VCF -> graph... I don't know what tool you would use there. You might want your own tool based on individual pairwise alignments against whatever you are using for your reference.

I've never seen a VCF with genotypes for 16,000 samples in it, let alone tried to run vg on it. But the GBWT should store haplotypes in sub-linear space, so it might actually work.

If you aren't working with the genotypes but just the variable sites, vg probably ought to handle the VCF just fine as input.

jeizenga commented 2 years ago

If the sequences are mostly related by simple mutations (e.g. small indels, substitutions), you could also generate a multiple sequence alignment with an external tools and then use it to construct the graph with vg construct -M.

glennhickey commented 2 years ago

You could try making a graph with Minigraph, or Minigraph and Cactus together, though I don't know if that would scale enough either. @glennhickey might be able to guess.

minigraph-cactus will not scale to 16k sequences. If you have a guide tree, "regular" cactus should work fine but then you wouldn't get a nice vcf.