vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.07k stars 191 forks source link

VG autoindex not working with PGGB GFA #4302

Open KopalliV opened 1 month ago

KopalliV commented 1 month ago

I am trying to generate indexes for giraffe of a GFA file generated using PGGB but I am getting the following error continuously.

Command used : vg autoindex -w giraffe --prefix pggb_real --gfa pggb.gfa

Error: [vg autoindex] Executing command: vg autoindex -w giraffe --prefix pggb_real --gfa pggb.gfa [IndexRegistry]: Checking for haplotype lines in GFA. [IndexRegistry]: Constructing VG graph from GFA input. [IndexRegistry]: Constructing XG graph from VG graph. error[VPKG::load_one]: Correct input type not found while loading handlegraph::PathHandleGraph

Can somebody guide me if I am doing something wrong here.?

Vg version: v1.57.0-21-gdb574a520 "Franchini"

adamnovak commented 3 weeks ago

Hmm, that doesn't look like it ought to happen, nor like it is happening at a stage where PGGB graphs are known to be difficult. Do you have a .vg graph produced? Or do you get one in a manually specified temp directory if you vg autoindex --tmp-dir ./wherever ...? What does vg stats --format whatever.vg say? Or xxd whatever.vg | head -n10?

KopalliV commented 3 weeks ago

I was just going to post an update that it ran further when a --tmp-dir is specified but still does not generate all indexes. It gives a GBZ file but gets killed while generating the distance index.

vg autoindex -w giraffe --prefix pggb_real --gfa pggb.gfa -T ../../temp/ [IndexRegistry]: Checking for haplotype lines in GFA. [IndexRegistry]: Constructing VG graph from GFA input. [IndexRegistry]: Constructing XG graph from VG graph. [IndexRegistry]: Constructing a greedy path cover GBWT [IndexRegistry]: Constructing GBZ using NamedNodeBackTranslation. [IndexRegistry]: Constructing distance index for Giraffe. Killed

adamnovak commented 3 weeks ago

That sounds like maybe it had run out of disk space the first time, and now it's running out of memory. You can try giving it more memory; 1 or 2 terabytes might be able to do it, if it can be done at all.

This is a known issue with PGGB graphs: they have very large individual "snarls" without a lot of internal structure, so the distance index needs to hold some quadradically large all-to-all distance matrices. We have a parameter (maybe only in the manual indexing pipeline?) that lets you control how big the biggest matrix we store is, but sometimes you can set that low enough to build the index and then get terrible runtime performance because when the distances aren't in the index it has to do runtime traversals of the graph to try and find them.

The other solution is to aggressively prune the PGGB graphs with vg prune until enough complex regions have been flattened out that it can be indexed. But then you're not really working with the graph you want to work with.

We need some kind of new distance indexing technology to get good performance on PGGB graphs that we have not yet invented, unfortunately.