vgteam / vg

tools for working with genome variation graphs

https://biostars.org/tag/vg/

Other

1.1k stars 194 forks source link

vg call result question #3974

Open shimiao123 opened 1 year ago

shimiao123 commented 1 year ago

hi vg team:

 I have a pangenome graph，and use vg to call variants。command:

vg autoindex --workflow giraffe -r ref.fasta -v test.vcf -t 30 -p Ah -T ./tmp vg giraffe -t 60 -Z Ah.giraffe.gbz -m Ah.min -d Ah.dist -f test_R1.fq.gz -f test_R2.fq.gz -p > ICG10468.gam vg pack -t 60 -x Ah.giraffe.gbz -g ICG10468.gam -Q 5 -s 5 -o ICG10468.pack vg call Ah.giraffe.gbz -a -t 60 -k ICG10468.pack > ICG10468.vcf

I confused about the output vcf file. Some variant positions same between input test.vcf and output ICG10468.vcf, but some positions difference. Why? What happends? example: Some variant positions in input test.vcf file: chr01 2488160 ... chr01 2502400 ... chr01 2504303 ... Some variant positions in output ICG10468.vcf file: chr01 2488160 ... chr01 2502400 ... chr01 2504304 ... hope for your reply.

Bests

glennhickey commented 1 year ago

Yeah, some variants can move around during graph construction. Also, while calling, variants that are side-by-side may be called together, leading to sites with slightly different coordinates.

There is a way to genotype your input vcf with vg call, but I'm not sure it works with autoindex (@jeizenga would be nice to add this recipe).

You'd need to do

vg construct -a -r ref.fasta -v test.vcf > test.vg
vg index test.vg -L -x test.xg
vg index test.vg -v test.vcf -G test.gbwt
vg gbwt -x test.xg test.gbwt -g test.gbz --gbz-format
vg index test.gbz -j test.dist
vg minimizer test.gbz -d test.dist -o test.min

After that you'd run giraffe on the .gbz / .dist / .min files, and then pack and call using the .xg. When you run call, you can then pass in test.vcf with call -v and it will give you genotypes exactly on the input vcf.

shimiao123 commented 1 year ago

Thanks for your advise. Follow your advise, I ran by the command: step1：vg construct -a -t 60 -r ref.fasta -v test.vcf > test.vg step2：vg index test.vg -t 60 -L -x test.xg step3：vg index test.vg -t 60 -v test.vcf -G test.gbwt step 1 and 2 can normally run without any error，but step 3 with error like this: error: [HaplotypeIndexer::parse_vcf] variant file 'test.vcf' does not contain phasings.

My vcf file like this;

fileformat=VCFv4.2

contig=

......

contig=

FILTER=

INFO=

reference=STQ_genome.fasta

CHROM POS ID REF ALT QUAL FILTER INFO Sample

chr01 13093 . T TGGTGGCTACTCCAATGAAGATTTAATGATTATCATCATGTGAAGATACATCATTTTGACCATTGGATGATAGATTGTAGGGCTTGATTTGATTTGAAGTAAAAGTGTTACTTTTATTTAAAATGTTGCTAAATAAATAAGTTACACTTTTTGAACAAGGATCATCATGAGAAGATGTTTTTGCCATCTTCATGGGAGTAGTTGCCATAAAAAAT . . SVTYPE=PAV L542 L542,90632,90846,215 Please tell me what happends and what can I do for this?

glennhickey commented 1 year ago

Yeah, that pipeline requires the input variants be phased.

shimiao123 commented 1 year ago

So,for my data,how can I get the same variant positions between my input test.vcf and output ICG10468.vcf? I have 5 samples for calling variants, I need know which are the common shared variants, how can I get the result?