Open sumit-walia opened 2 years ago
We used to support vg msga
for making graphs from sequences, but it doesn't scale that high in sequence count.
You could try making a graph with Minigraph, or Minigraph and Cactus together, though I don't know if that would scale enough either. @glennhickey might be able to guess.
You could also throw @ekg's PGGB tool at the problem.
If you want to go sequences -> VCF -> graph... I don't know what tool you would use there. You might want your own tool based on individual pairwise alignments against whatever you are using for your reference.
I've never seen a VCF with genotypes for 16,000 samples in it, let alone tried to run vg on it. But the GBWT should store haplotypes in sub-linear space, so it might actually work.
If you aren't working with the genotypes but just the variable sites, vg probably ought to handle the VCF just fine as input.
If the sequences are mostly related by simple mutations (e.g. small indels, substitutions), you could also generate a multiple sequence alignment with an external tools and then use it to construct the graph with vg construct -M
.
You could try making a graph with Minigraph, or Minigraph and Cactus together, though I don't know if that would scale enough either. @glennhickey might be able to guess.
minigraph-cactus will not scale to 16k sequences. If you have a guide tree, "regular" cactus should work fine but then you wouldn't get a nice vcf.
I am trying to generate variation graph (vg) from raw SARS-CoV2 sequences (~16k sequences). What would be the best way to generate VCF given these raw sequences? And, does the method works as the data scales up?