vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.1k stars 194 forks source link

Output VCF , variants informations (SNPs, Indels, SVs etc) #3679

Open alinehugo opened 2 years ago

alinehugo commented 2 years ago

1. What were you trying to do? Understand the output VCF of vg call

2. What did you want to happen? Analyse VCF file, first retrieve which kind of variant is present in each position from INFO field as ''usual'' VCF as in SVTYPE in this exemple

3. What actually happened? There's no such an info in the output VCF exemple of output :

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  T004
Chr01       223     >15770416>8     AT      A       20.482  PASS    AT=>15770416>15770417>8,>15770416>8;DP=28       GT:DP:AD:GL:GQ:GP:XD:MAD        0/1:28:24,4:-7.56791,-6.00864,-53.1783:22:-1.10412:19.5098:4

4. If you got a line like Stack trace path: /somewhere/on/your/computer/stacktrace.txt, please copy-paste the contents of that file here: NONE

5. What data and command can the vg dev team use to make the problem happen?

i used usual vg commands pipeline construct > giraffe > augment > snarls-pack > call

6. What does running vg version say?

vg version v1.40.0 "Suardi"
Compiled with g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 on Linux
JD12138 commented 2 years ago

Analyse VCF file, first retrieve which kind of variant is present in each position from INFO field as ''usual'' VCF as in SVTYPE in this exemple I have the same quaetion. How to get the SVTYPE from the output vcf file of vg?

glennhickey commented 2 years ago

From the VCF spec:

##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">

Value should be one of DEL, INS, DUP, INV, CNV, BND. This key can be derived from the REF/ALT fields but is
useful for filtering

The two issues for vg are:

But that said, I think you raise a fair point: we should at the very least provide scripts or suggestion of best practices for cleaning of the VCFs and categorizing the SV calls, as we end up doing this ourselves too when analyzing them.

evcurran commented 1 year ago

Has there been any progress regarding best practices when it comes to populating the SVTYPE and possibly the SVLEN field in a VCF generated from a pangenome graph? It would be useful to be able to make comparisons to VCFs produced by sniffles, etc. Thanks!

sen1019san commented 1 year ago

Hi, @glennhickey. It is hard to understand the 'INFO' field from the vg call ouput. Since users care more for the variant information like SV position, SV type, and SV id as the input vcf file for autoindex-giraffe-pack-call workflow. So it is helpful to output the raw SV information for vg call. I sincerely hope vg team optimize for this problem.

evcurran commented 1 year ago

For anyone coming across this issue with the same problem, I have found the 'truvari' tool useful for populating the INFO field of pangenome-derived VCFs. Running 'truvari anno' (https://github.com/acenglish/truvari/wiki/anno) allows you to include the SVLEN and SVTYPE tag. However it can only accurately label straightforward insertions and deletions, everything else it tagged as 'UNK', so this isn't a perfect solution. It would be great to be able to compare the output with a VCF derived from a tool such as sniffles.

sen1019san commented 1 year ago

Hi, @evcurran. I solved the problem by the similar way. But the key problem is to compare the vcf generated by vg call and the original input vcf, I find it's hard to compre this two vcf file since the variants coordinate are different.

Donandrade commented 1 week ago

Hello everyone!

Has anyone here found a good solution to the issue?