vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.09k stars 193 forks source link

GCSA indexing with GBWT-backed pruning is very slow on GRCh37 #1999

Open adamnovak opened 5 years ago

adamnovak commented 5 years ago

@cmarkello has some graphs that have taken over 9 days to GCSA index and are still going.

They were built with the GRCh37 1000 Genomes data, and he's using the GBWT haplotypes to prune the graph before GCSA indexing. He's running the whole operation with toil-vg, and he's only building the one ("pangenome") graph.

The graph shouldn't take this long to GCSA index.

jltsiren commented 5 years ago

GCSA indexing should not take that long. The expected failure modes are crashing due to corrupted data and running out of memory or disk space when the graph is too complex. Is the kmer generation still running, or has the indexing proceeded to actual GCSA construction?