vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.12k stars 194 forks source link

Higher vg call AD values after using vg filter #4443

Open danielwood1992 opened 1 week ago

danielwood1992 commented 1 week ago

1. What were you trying to do?

vg giraffe --output-format gam --gbz-name $vg.gbz --minimiser-name $autoindex.min --dist-name $autoindex.dist -f $R1 -f $R2 -t $threads $reference.fa $vcf.gz > $gam
vg filter $gam -r 0.9 -fu -m 1 -q 20 -D 999 -x $vg.gbz > $filter.gam
vg pack -t $threads -x $vg.xg -g $filter.gam -Q 5 -o $filter.gam.pack
vg call $vg.xg -t $threads -k $filter.gam.pack -v $vcf > $filter.gam.pack.vcf

vg pack -t $threads -x $vg.xg -g $gam -Q 5 -o $gam.pack
vg call $vg.xg -t $threads -k $gam.pack -v $vcf > $gam.pack.vcf

2. What did you want to happen?

Identify coverage of ref/alt alleles from the vcf call AD statistic - and looking at the impact of the vg filter/call step.

3. What actually happened?

Running vg filter on the gam reduced the size of the gam substantially. In most cases the AD for each site in $gam.pack.vcf vs. $filter.gam.pack.vcf made sense: the AD was higher for the calls from the unfiltered gam. However there a minority of sites (~2.5%) of cases where the AD for either the reference or alt allele is higher in the filtered gam than the same site in the unfiltered gam (which substantially alters the estimated ref/alt allele coverage - see example graph AF_Graph , blue points are where the ref or alt allele coverage is higher from the filtered vcf, red are where it is lower) - I don't understand how this can be the case, surely the filtering should only be removing reads? The graph is made from a biallelic vcf so it shouldn't be mutli-allele mapping that is the problem.

5. What data and command can the vg dev team use to make the problem happen? I can send the data and exact script I used if needed. Any thoughts on this much appreciated!

6. What does running vg version say?

vg version v1.48.0 "Gallipoli" Compiled with g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 on Linux Linked against libstd++ 20210601```

glennhickey commented 2 days ago

Yeah, that doesn't seem to make much sense. Are you able to share the gam and gbz, along with a site that's identical in the vcf but has higher AD in the filtered one? Thanks.