vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.1k stars 194 forks source link

error [vg call]: No reference paths found #4194

Closed YibinQiu closed 9 months ago

YibinQiu commented 9 months ago

1. What were you trying to do? I wish to manually construct and index the graph using vg construct -a and perform SV genotyping of an individual with second-generation sequencing data using vg call -v. I followed the commands as per the official vg documentation (https://github.com/vgteam/vg/wiki/SV-Genotyping-and-variant-calling):

#vg construct -a 
vg construct -f -a -S -r $ref -v convert_merge.vcf.gz -p > out.vg
vg index -t 64 out.vg -L -x out.xg -p --temp-dir ${LARGE_DISK_TMP}
vg index -t 64 out.vg -v convert_merge.vcf.gz -G out.gbwt -p --temp-dir ${LARGE_DISK_TMP}
vg gbwt -x out.xg out.gbwt -g out.gbz --gbz-format -p --temp-dir ${LARGE_DISK_TMP}
vg index -t 64 out.gbz -j out.dist -p --temp-dir ${LARGE_DISK_TMP}
vg minimizer -t 16 out.gbz -d out.dist -o out.min -p
vg snarls -t 64 out.xg >out.snarls

#vg giraffe pack and call -v
cp out.gbz ${sample}_out.gbz
cp out.dist ${sample}_out.dist
cp out.min ${sample}_out.min
cp out.gbwt ${sample}_out.gbwt
cp out.xg ${sample}_out.xg
cp out.snarls ${sample}_out.snarls
cp out.vg ${sample}_out.vg
vg giraffe -p -t 24 -Z ${sample}_out.gbz -d ${sample}_out.dist -m ${sample}_out.min -f ${sample}_1.fq.gz -f ${sample}_2.fq.gz > ${sample}.gam
vg pack -t 24 -x ${sample}_out.gbz -g ${sample}.gam -o ${sample}.pack -Q 5
vg call -t 24 ${sample}_out.gbz -r ${sample}_out.snarls -k ${sample}.pack -s ${sample} -v convert_merge.vcf.gz -f $ref > ${sample}.vcf

2. What did you want to happen? The SV information contained in the VCF file generated by the vg call step corresponds to the VCF input

3. What actually happened? When running vg call out.gbz, the following error is reported: error [vg call]: No reference paths found

When I replaced ${sample}_out.gbz with ${sample}_out.xg, vg call was able to run normally, but it displayed:

[VCFTraversalFinder] Warning: Site {'directed_acyclic_net_graph': true, 'end': {'node_id': '8705753'}, 'end_self_reachable': true, 'start': {'node_id': '8645904'}, 'start_end_reachable': true, 'start': {'node_id': '8645904'}, 'start_end_reachable': true, 'start_self_reachable': true} with 304 variants contains too many traversals (>50000) to enumerate so it will be skipped.

Following the suggestion #3950 , I also used :

vg call -t 24 ${sample}_out.gbz -r ${sample}_out.snarls -k ${sample}.pack -s ${sample} -a -z > ${sample}.vcf

but the result was the same, showing: error [vg call]: No reference paths found. So, I'm not sure which step I did wrong?

Additionally, I previously tried using vg autoindex to construct the graph, and performing vg call on out.gbz with 15X second-generation sequencing data only took about half an hour. Now, using vg construct -a and vg call out.xg, I find that the runtime of vg call is too long (~5 hours).

4. If you got a line like Stack trace path: /somewhere/on/your/computer/stacktrace.txt, please copy-paste the contents of that file here:

Place stacktrace here.

5. What data and command can the vg dev team use to make the problem happen?

vg call -t 24 out.gbz -r out.snarls -k ${sample}.pack -s ${sample} -v convert_merge.vcf.gz -f $ref > ${sample}.vcf

6. What does running vg version say?

vg version v1.53.0 "Valmontone"
Compiled with g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 on Linux
Linked against libstd++ 20230528
Built by anovak@courtyard.gi.ucsc.edu
glennhickey commented 9 months ago

Can you please provide the output of vg paths -Mx out.gbz? It looks like your graph has now reference paths, which is strange since it comes from vg construct

YibinQiu commented 9 months ago

Can you please provide the output of vg paths -Mx out.gbz? It looks like your graph has now reference paths, which is strange since it comes from vg construct

@glennhickey Yes, the path of the reference (like chr1, chr2, chr3, etc.) seems to have been lost, leaving only the path information of the individuals, which is very strange. The output looks like this (I also uploaded the file):

image

test.txt

I also test vg paths -Mx out.vg. The output looks like this: image

Perhaps I should follow these steps to build gbz? https://github.com/vgteam/vg/wiki/VG-GBWT-Subcommand (I'm trying it, and the process is still ongoing): image

Now, I use out.xg to vg call, When I ran a single instance, it took 5 hours, but when I submitted 50 tasks simultaneously on Slurm, it has been running for 14 hours and the tasks are still not finished.

vg call -t 24 ${sample}_out.xg -r ${sample}_out.snarls -k ${sample}.pack -s ${sample} -a > ${sample}.vcf

If possible, I would prefer to use out.gbz instead of out.xg, as it should save a lot of time, right?

YibinQiu commented 9 months ago

I follow the steps to build gbz( https://github.com/vgteam/vg/wiki/VG-GBWT-Subcommand):

vg construct -f -a -S -r $ref -v convert_merge.vcf.gz -p > out.vg
vg gbwt -x out.vg -v convert_merge.vcf.gz -o out.gbwt -p --temp-dir ${LARGE_DISK_TMP}vg gbwt -x out.vg -E -o out.paths.gbwt -p --temp-dir ${LARGE_DISK_TMP}
vg gbwt -m out.gbwt out.paths.gbwt -o out.combined.gbwt -p --temp-dir ${LARGE_DISK_TMP}
vg gbwt -x out.vg out.combined.gbwt --gbz-format -g out.gbz -p --temp-dir ${LARGE_DISK_TMP}

and then try vg paths -Mx out.gbz again, the head of output looks like: image

but when I grep 'GENERIC', the output seems having reference paths: image

@glennhickey I would like to ask for your advices. Is it correct to do so for the above results? I run vg call -t 24 out.gbz -r ${sample}_out.snarls -k ${sample}.pack -s ${sample} -a -z > ${sample}.vcf, and it worked fine (~30 mins for one sample).

But when I use this out.gbz to vg call -v :

vg call -t 24 out.gbz -r ${sample}_out.snarls -k ${sample}.pack -s ${sample} -v convert_merge.vcf.gz -f $ref > ${sample}.vcf`

It output the empty vcf, the log file displayed:

[VCFTraversalFinder] Warning: No alt path (prefix=_alt_659226977429afe25da2e2b94abbb1a7516628ab_) found in graph for variant.  It will be ignored:
chr1    20973   515:INS:chr1_20974  TA  TGCCCTGGGCCCCGGCTGGCCCGTGCCCTGGCCCTG    60  PASS    SVTYPE=INS;SVMETHOD=SURVIVOR1.0.7;CHR2=chr1;END=20974;CIPOS=0,0;CIEND=20974,0;STRANDS=+-
[VCFTraversalFinder] Warning: No alt path (prefix=_alt_b0b0a41960b57218cbc66182eaeaad643ad8e6b9_) found in graph for variant.  It will be ignored:
chr1    21049   704:INS:chr1_21050  CT  CAAAGAATGTAGATTTTAACTCTTTAACTTCAGATTTCTAATTT    60  PASS    SVTYPE=INS;SVMETHOD=SURVIVOR1.0.7;CHR2=chr1;END=21050;CIPOS=0,0;CIEND=21050,0;STRANDS=+-
....
....
....

I have looked at several similar questions posed by others (#3963 #3974 #3950) and saw your answer; it seems that 'vg call -v' can only be used with xg.

glennhickey commented 9 months ago

A few things. If you want to use -v with vg call, you need to follow the instructions here.

It looks like you are doing this for the index, but this part here is important

You can now use test.gbz, test.dist and test.min as input for vg giraffe and test.xg as input for vg call.

Ie, you need to run vg call on the xg and vcf in order for it to work (even though you ran vg giraffe on the gbz).

If you do not want to pass in your vcf with -v then you can don't need to bother with this, and can just run call on the .gbz you used for giraffe (that you made here). To speed that up you can try vg call -z

YibinQiu commented 9 months ago

Thank you for the help with this