treangenlab / komb

KOMB is a tool for fast identification of unitigs of interest in metagenomes. KOMB introduces the concept of a Hybrid Unitig Graph (an extension to compacted de Bruijn graphs) and relies on k-core and K-truss decomposition algorithms.
GNU General Public License v2.0
5 stars 0 forks source link

Replacing igraph with own graph implementation #11

Open nsapoval opened 1 year ago

nsapoval commented 1 year ago

Currently KOMB relies on igraph for k-core decomposition. Hence, although all the adjacency information for the graph construction is done in the functions Kgraph::readSAM, Kgraph::getEdgeInfo, and Kgraph::generateGraph we still need to create and initialize an igraph graph object. This step takes a significant amount of time, and the resulting graph data structure requires a large amount of RAM to be stored. This is limiting in the following ways:

  1. Large RAM usage prevents KOMB from being run on extremely large datasets (e.g. the whole human genome data as exemplified by HG002 300x Illumina data from GIAB consortium).
  2. Redundancy of converting already stored information into a different object comes at a large runtime cost. Current KOMB run on HG002 chr11 spends half of all processing time initializing the igraph object.
  3. Currently igraph only implements serial graph construction, and serial k-core decomposition. Recent work showed promising parallelization options for k-core decomposition, and hence it might be useful to implement these algorithms as an alternative for very large graphs.

Thus, it would be a good idea to implement an alternative graph data structure that is optimized for KOMB and can support parallel construction and k-core decomposition.

szhorvat commented 1 year ago

I thought I'd add some clarifications so you know what memory you can expect from igraph.

Small additional performance-related comments:

3. Recent work showed promising parallelization options for k-core decomposition

Performance-related contributions to igraph are of course always welcome.

nsapoval commented 1 year ago

@szhorvat thanks for these pointers and catching some memory leaks. We'll aim to address those soon, and hopefully that'll bring additional improvement to the performance.

I have a quick follow up question on the VECTOR macro in the context of the graph construction. Is there a way to initialize the graph data structure by allocating appropriate memory for vertex and edge arrays, and then fill in the values in parallel? Specifically, let's say we have a graph with V={0,1,2,3} and E={(0, 1,) (0, 2), (0, 3), (1, 2), (1, 3), (2, 3)}, I'd like to be able to allocate space for E, and then write the {(0, 1,) (0, 2), (0, 3)} and {(1, 2), (1, 3), (2, 3)} blocks in parallel. It appears to me that the VECTOR macro should allow for such an operation, but I am not certain how to split graph initialization in igraph into a pure memory allocation step and a separate parallel filling in step.

szhorvat commented 1 year ago

No, this is not possible, and I doubt that parallel copying would save much time. The bottleneck is not copying, but constructing the lookup tables that make efficient retrieval of adjacent vertices and incident edges possible. This includes sorting indices for efficient lookup (binary search).

I am surprised that you say that graph construction takes noticeable time compared to k-core decomposition, but I don't think we have benchmarks for graph construction at the moment. It would be worth adding some.

nsapoval commented 1 year ago

Ah, I see, that makes more sense, as I was quite surprised discovering that the construction is taking a while. If there is more post processing done that could explain it at least partially.

I also agree that the construction being slow compared to the analysis is rather peculiar.

At the moment I am traveling and my bandwidth for implementation and testing work is limited, but I'd be more than happy to reconnect and communicate on this more in August. In particular, I think that on our side having some additional benchmarking done can help pinpoint the bottlenecks in our code and the driving causes of them. In the meantime, the ballpark sizes of the graphs where I noticed degraded performance is on the scale of 1 to 5 million vertices with around 300 to 600 million edges. Core structure wise the graphs tend to have the bulk of vertices in low k-shells (k<=5).

szhorvat commented 1 year ago

I also agree that the construction being slow compared to the analysis is rather peculiar.

If this is really true (you benchmarked it), it is possible that what's actually slow is setting the string attribute, but the actual graph data structure construction. Each string, i.e. each vertex name, will be allocated with a malloc call, and malloc is slow.

This is just a guess. It'll be good to look into this. Thanks for pointing out this potential problem.

I'll also be mostly unavailable until September.