vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.09k stars 193 forks source link

need to repeat index before vg call #4138

Closed carolynzy closed 9 months ago

carolynzy commented 10 months ago

1. What were you trying to do? mapping short reads to and call SVs from reference + vcf graph.

2. What did you want to happen? vg call produce vcf file with SVs called.

3. What actually happened? give me warning message like :

[VCFTraversalFinder] Warning: No alt path (prefix=_alt_17ba68d909040790ad006577e49c117f93c46c0c_) found in graph for variant.  It will be ignored:

The output vcf file is empty with only headers. But if I repeat the index step again as:

vg index del1000.vg -L -x del1000.xg

vg call will work normally.

So I guess some step between vg index and vg call made some changes to the xg file.? I repeated each index step one by one and did vg call after each repeat. Then I found that after this step:

vg index -x del1000.xg -g del1000.gcsa -k 16 del1000.vg

the warnings appeared. And if I repeat the index step as stated above, the warnings dissappeared and vg call produced vcf file with SVs called.

4. If you got a line like Stack trace path: /somewhere/on/your/computer/stacktrace.txt, please copy-paste the contents of that file here:

No. I didn't get any other error or warning message.

5. What data and command can the vg dev team use to make the problem happen?

SURVIVOR bedtovcf del1000.bed DEL del1000.vcf
bgzip del1000.vcf
tabix -p vcf del1000.vcf.gz
vg construct -r GCF_000195955.2_ASM19595v2_genomic.fna -v del1000.vcf.gz -S -f -p -i -a > del1000.vg
vg index del1000.vg -L -x del1000.xg
vg gbwt -x del1000.xg -P -g del1000.gbz --gbz-format
vg index del1000.xg -j del1000.dist
vg minimizer del1000.gbz -d del1000.dist -o del1000.min
vg snarls -T del1000.gbz > del1000.snarls 
vg index -x del1000.xg -g del1000.gcsa -k 16 del1000.vg
SAMPLE=2
vg giraffe -Z del1000.gbz -d del1000.dist -m del1000.min -f  420fastq/$SAMPLE'_1.fq.gz' -f 420fastq/$SAMPLE'_2.fq.gz' -o gam > $SAMPLE.del1000.gam
vg pack -x del1000.xg -g $SAMPLE.del1000.gam -o $SAMPLE.del1000.pack -Q 5
vg index del1000.vg -L -x del1000.xg # repeat 
vg call del1000.xg -v del1000.vcf.gz -r del1000.snarls -k $SAMPLE.del1000.pack -s $SAMPLE -d 1 -f GCF_000195955.2_ASM19595v2_genomic.fna > $SAMPLE.del1000.vcf

6. What does running vg version say?

vg version a97a233
Compiled with g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 on Linux
Linked against libstd++ 20230528
Built by carolynzy@beast

I'm not sure if I did something wrong or this is normal? Besides, each time when I use vg call, there is a lot of information, basically the variants in the graph printed out to the screen, no matter whether the command works or not. How could I stop this? It's a bit annoying but not very serious.

Thank you for your response!

Update: I just realized that if I use giraffe for mapping, I don't need to produce gcsa file so basically this issue shouldn't arise. However, I'm still curious what could trigger this problem. Thanks!

adamnovak commented 10 months ago

When you run:

vg index -x del1000.xg -g del1000.gcsa -k 16 del1000.vg

it will read del1000.vg and write out del1000.gcsa and del1000.xg.

When you run:

vg index del1000.vg -L -x del1000.xg

it will read del1000.vg and write out del1000.xg including any "alt paths" (that have names starting with _alt_). We exclude those from the .xg file unless you pass the -L option, because usually you don't need them and they can make the file large.

But for your workflow it seems you do need them, because vg call is going to need them when passed the -v option and a VCF. These "alt paths" are how vg call finds what parts of the graph came from what parts of the VCF, so it can restrict itself to genotyping the variants described in the VCF and ignore other ones. So you should ask for the "alt paths" to be included in your XG when you make the GCSA and XG files together, by adding -L to that command:

vg index -x del1000.xg -g del1000.gcsa -k 16 del1000.vg -L

Where did you get the commands you are running? Is there some documentation or example we should update that is missing the -L option but still shows running vg call with the -v option? Or did you add the -v option yourself and not get any good guidance that you would need an xg made with -L in order for that to work?

carolynzy commented 9 months ago

Thank you @adamnovak ! I didn't realize it would reproduce the xg file at this step. Now it makes sense to me.

I don't remember where exactly I got these commands because I have tried different ways and may have mixed commads from different sources. I'm still trying to understand the whole pipeline and basic logic of it, so I probably made some mistakes while trying so. If I found out later I will update here.