vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.09k stars 193 forks source link

PathGraphBuilder::write(): Size limit exceeded, construction aborted #3339

Open brianabernathy opened 3 years ago

brianabernathy commented 3 years ago

I am trying to index a gfa constructed from 3 closely-related melon sequences (to use with vg map) using vg autoindex. The gfa is ~2G, not particularly large or complex. I'm running the job on a machine with 10+T free disk and 1.5T mem, no other jobs are running on the machine. relevant autoindex options are...

The command eventually fails with the following output written to stderr.

[IndexRegistry]: Constructing VG graph from GFA input.
[IndexRegistry]: Constructing XG graph from VG graph.
[IndexRegistry]: Pruning complex regions of VG to prepare for GCSA indexing.
Restored graph: 26465168 nodes
[IndexRegistry]: Constructing GCSA/LCP indexes.
InputGraph::InputGraph(): 1340047470 kmers in 1 file(s)
InputGraph::readKeys(): 451494100 unique keys
InputGraph::readFrom(): 1133715333 unique start nodes
GCSA::GCSA(): Prefix-doubling from path length 16
GCSA::GCSA(): Step 1 (path length 16 -> 32)
GCSA::GCSA(): Step 2 (path length 32 -> 64)
PathGraphBuilder::write(): Size limit exceeded, construction aborted

I've used benchmarking tools to check memory usage and have found ~768G is being consumed before failure, despite --target-mem 1500G being specified. I noticed the doc states -M, --target-mem MEM target max memory usage (not exact, formatted INT[kMG]) (default: 1/2 of available) So I tried running the command without the --target-mem option and with --target-mem 3000G. All attempts resulted in failure at the same processing step with similar run time and memory usage.

$ vg version
vg version v1.33.0 "Moscona"
Compiled with g++ (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 on Linux
Linked against libstd++ 20200808
Built by anovak@octagon

Are there other autoindex options or any manual indexing steps you'd recommend trying?

jltsiren commented 3 years ago

The error message indicates that the total size of temporary files exceeded the specified limit (default 2 TB). This is a safety feature, as GCSA construction could plausibly use petabytes of disk space before crashing. The limit can be adjusted in vg index but apparently not in vg autoindex.

You have a graph based on aligned sequences, but vg autoindex seems to prune it with the "restore paths" option. That option is only intended for VCF-based graphs, as it will undo any pruning in alignment-based graphs if the aligned sequences are present as paths. As a result, the number of kmers grows far too quickly with k and you will eventually run out of disk space and/or memory.

Manual index construction should work here: https://github.com/vgteam/vg/wiki/Index-Construction. You first convert the GFA to another graph type (HashGraph?) with vg convert and then follow the "small graph" / "complex graph" / "with many paths" branch.

brianabernathy commented 3 years ago

Thanks for the detailed diagnosis, I was able to manually index the graph. It seems that vg autoindex could benefit from an option to select sequence-based or vcf-based graph input. A temp disk limit similar to vg index -Z would be useful too.

Thanks again for your work on this project.