vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.09k stars 193 forks source link

vg view to convert to GFA uses a huge amount of memory #2718

Open ekg opened 4 years ago

ekg commented 4 years ago

I'm trying to get GFA output from a .vg file made by vg chunk.

It's not possible due to memory exhaustion (I'm on a puny system).

I can convert to PackedGraph format with vg convert:

vg convert -p SRR11267570.9kb.k16/chunk_0.vg >SRR11267570.9kb.k16/chunk_0.pg

This uses "only" 1G for a 162MB .vg file.

But, when I try to convert this to GFA I run out of memory:

vg view SRR11267570.9kb.k16/chunk_0.pg >SRR11267570.9kb.k16/chunk_0.gfa

How is this possible when the PackedGraph model is only 83MB? It seems that something is off with vg view. Maybe we should deprecate it and do everything for conversion in vg convert.

jeizenga commented 4 years ago

It looks to me like we are converting it to a VG in-memory to use the graph_to_gfa algorithm, which is not handlified: https://github.com/vgteam/vg/blob/master/src/subcommand/view_main.cpp#L842 In addition, graph_to_gfa creates the entire text representation of the GFA in gfakluge before serializing it. https://github.com/vgteam/vg/blob/59336df5d686f727ada03e90b26cba2eaa9bc04f/src/gfa.cpp#L1077 The GFA format is pretty simple, I don't see why we shouldn't be streaming it out from the HandleGraph.