pangenome / smoothxg

linearize and simplify variation graphs using blocked partial order alignment
Other
56 stars 6 forks source link

GCSA on smoothxg GFAs #5

Open fbemm opened 4 years ago

fbemm commented 4 years ago

Dear Erik,

I am trying to GCSA index a graph from edyeet->seqwish->smoothxg->vg view->vg prune -r.

Input are 3 small genomes (<100Mb), each with around 40 "contigs". The graph has 190k segments and 240k edges.

Running vg index to generate the gcsa index results in very large tmp files (>2Tb) and practically does not finish.

I am not sure where to start digging at the moment. I am trying to index the seqwish output now directly.

Bests, F

ekg commented 4 years ago

Yes, that makes sense. I think you're probably running across a lot of bubbles during the kmer generation. This is the basic flaw of the GCSA2 indexing strategy, at least as it's currently implemented. (We might simplify things for ourselves by just indexing the actual paths directly rather than the graph and its implied recombinations..)

It's worth trying to get this to work though. Usually, by decreasing the graph complexity (with pruning) and/or reducing the GCSA2 index kmer size you can always build the index.

I think you may need to use vg prune -u, to "unfold" the reference paths in bubbles to decrease the overall complexity of the graph. @jltsiren would know

I would also just try to index with a much smaller kmer size for the GCSA2 index. For instance:

vg index -x g.xg -g g.gcsa -k 11 -X 2

This would result in a 11 * 2^2 = 44 mer index. This should be faster to make. It'll be slightly worse for mapping, but not too much worse. Remember that these are just seeds for the mapping.

jltsiren commented 4 years ago

This looks like the "small graph with many paths" scenario in the wiki. vg prune -u should work here.

The vg prune -r approach is only appropriate for reference+VCF graphs. It first removes complex graph regions and then restores all nodes and edges used by paths. If the graph is based on multiple sequence alignment, vg prune -r probably won't do anything.