vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.12k stars 194 forks source link

how to combine individual sample vcf #3565

Open Aannaw opened 2 years ago

Aannaw commented 2 years ago

Hello I intend to use vg giraffe to align the many illumina pair-end short reads and then use vg call to determine the variants. But I actually do not find the way to combine the individual sample vcf into a vcf of all sample? Could you give me any suggestions?

glennhickey commented 2 years ago

bcftools merge. best to use -a with vg call before doing this to get reference genotypes.

Aannaw commented 2 years ago

Hello Thanks for your prompt reply! I have another questions. I am confused with the vcf file as input of Vg construct. Now I have a vcf from aligning to illumina short reads to the reference.fa, a vcf from aligningnanopore long reads to the reference.fa and a vcf from several genome assemblyaligning by mummer. I am not sure which one should be put in the Vg construct. I have read the description " In [vg construct](https://github.com/vgteam/vg/wiki/vg-construct), we take a VCF file and the reference sequence it was based on and generate a variation graph suitable for use by other tools in vg." from https://github.com/vgteam/vg/wiki/Construction. But it seems ambiguous and I am still not clear. Maybe could you give me any suggestions? Thanks!

adamnovak commented 2 years ago

Giraffe will work best on a graph made from a VCF that has phasing information, and where no part of the graph is too tangled or extremely dense in overlapping variation.

Exactly which graph to make depends on what kind of variation you are interested in. Giraffe assumes that most variation it is going to encounter is in the graph, and doesn't itself support finding split alignments. So if you want to be able to call larger-scale variation from Giraffe alignments, it needs to be in the graph. But, if it is in the graph, Giraffe can align short reads to it, the net result being that you ought to be able to genotype known structural variants in new samples using short read data that could not otherwise detect them.

You might actually be best off turning your mummer alignments into a GFA instead of a VCF, and using Giraffe with that. Giraffe works pretty well on assembly-based graphs, especially if you have a primary reference as a backbone to provide a known linear coordinate space.

If your nanopore-based VCF is actually full of called genotypes for samples, you could use that. If it's just a VCF describing how each individual read differs from the linear reference, it might not work very well.

Giraffe definitely works on VCFs called from Illumina data, especially if you can phase them somehow. But if the Illumina-based calling pipeline didn't call structural variants, Giraffe won't be able to see those structural variants either.