pangenome / odgi

Optimized Dynamic Genome/Graph Implementation: understanding pangenome graphs
https://doi.org/10.1093/bioinformatics/btac308
MIT License
194 stars 39 forks source link

Exploring potential opportunities to accelerate 2D Layout step #415

Closed tonyjie closed 1 year ago

tonyjie commented 2 years ago

Hi, I'm a Cornell ECE Ph.D. student participating on collaboration project with Erik. As far as I know, odgi layout is the most time-consuming step among all the tools provided in odgi. I tried this Human chromosome 8 graph, and the estimated time is more than 10 hours.

Currently I'm looking into the code and see if there are opportunities to accelerate it. Current implementation is Path-Guided SGD supported by multi-thread CPUs, a modification version of the original paper. The method is based on multidimensional scaling, which needs to compare the difference between geometric distance and graph-theoretic shortest path distance (in the path-guided setting, it is the positional difference in one path).

After going through some survey about the problem of graph drawing, I found there are some other popular methods. I'm wondering if you have tried them. If you have tried them, how's the speed / performance? If not, is there a lot difficulties implementing them (or just doesn't know them), or it doesn't suit for this problem?

The method I mentioned above is force-directed layout, which is another kind of method to solve graph drawing problem compared to multidimensional scaling (current method category) from my understanding. There are lots of work on different variants and parallel implementations on it. I think Gephi is a well-known visualization platform that uses ForcceAtlas2 algorithm. This platform seems general and easy to use. Have you tried visualizing pangenome graphs on it? How's the speed?

I found there are lots of parallelism that might be explored for the current algorithm. But I think it's better to consider the problem starting from the algorithm level. So I'm asking about whether you have tried other kinds of graph drawing methods here. Would be very pleased if you can answer! :)

Another question is about the metric of graph layout: how do we evaluate the results of visualization? From the graph drawing by SGD paper, they use stress, and it seems odgi layout code also use similar metrics. However, is it possible that the visualization graph is good enough at an early stage, which is far before we reach the stress threshold we set? Also, the iteration number is set to be proportional to the number of nodes (steps) in the graph, which could be very huge when scaling up. This leads to huge time consumption. Have you checked if it really needs so many iterations like this?

As I'm kind of new to genomics and these kind of visualization task, these questions concern me a lot. It would help a lot if you can answer them!

ekg commented 2 years ago

I tried every easily-adaptable graph layout method available prior to implementing the path guided SGD algorithm. I found them universally unworkable for pangenome graphs. Either the runtime to convergence was unacceptable, or the scaling was quadratic (or worse) in the number of graph nodes, or the results were poor (did not clearly reflect graph topology or path structures).

We are working up to a paper on this method where we will need to make comparisons that will clarify these issues. It would be great to attempt to run some of these methods on current chromosome scale graphs made by pggb.

The current settings of the path guided SGD layout algorithm are designed to optimize for several features. The layout should reflect the paths in the graph, preserving long range (million node order) correlations and arrangements at a structural level. It should also provide accurate descriptions of single base pair (character) nodes and their relationships. In other words we need to be able to load a meganode graph into gfaestus and zoom across seven or eight orders of magnitude while seeing reasonable structures at ever scale. SNPs and small indels should be clear from the layout, as well as megabase order inversions and segmental duplications (loops).

While these objectives are satisfied, the current algorithm parameters are probably overly conservative, leading to the long runtime. I also suspect that the convergence condition could be improved. And moreover the sampling pattern that gives rise to the model update steps is probably not sampling across nodes in a uniform enough way. We are sampling relative to paths, which leads to rare nodes being sampled infrequently. To lay these out correctly we then have to run many iterations. This has some advantages for visualization and analysis, because the layout reflects a mean model of the genomes in the graph, but it causes slower runtime because we do a lot of redundant work on common nodes. So there are easy ways to improve.

To be certain that this approach makes sense, a few tests with other layout algorithms would suffice. We have a metric similar to stress that measures the divergence between path positions and the graph layout (in odgi stats). This is something easy to use and already implemented.

A key feature that is not present in other methods is the guidance of the layout using paths in the graph. To my knowledge, all other layout algorithms are based solely on graph topology. The path information gives us a boost by simplifying what we need to consider. It is also the critical element of a pangenome graph.

This is a lot but hopefully can get the conversation started.

subwaystation commented 2 years ago

Hi @tonyjie,

The method presented in the original paper paper was not able to satisfy our requirements. Therefore, the following modifications were made:

In general, we need a layout algorithm that works in 1D, 2D, and respects and uses the given biological information in the paths. Except for the Path-Guided Stochastic Gradient Descent algorithm implemented here, I am not aware of alternatives. If you are, please share! Bandage is able to visualize outputs from e.g. minigraph, because a minigraph graph's nodes are at least 50 base pairs big. However, our variation graphs are precise down to the base pair level. Bandage can't scale.

Taking a very brief look at the literature, the default ForceAtlas2 algorithm in Gephi does not seem to scale to our requirements. Most graphs of comparisons of force-directed layout algorithms are very tiny compared to what we have. For example "Even if some of the networks are large (23,133 nodes and 186,936 edges) while others are very small (5 nodes and 5 edges)" from the ForceAtlas2 paper. https://www.hindawi.com/journals/abi/2017/1278932/ claims that Gephi "can easily render networks up to 300,000 nodes and 1,000,000 edges. Compared to other tools, it comes with a very efficient multithreading scheme". Still a small graph.

A GPU-based implementation of the ForceAtlas2 shows "overall speedup factors between 40x and 123x compared to existing CPU implementations. In practice, this means that a network with 4 million nodes and 120 million edges can be visualized in 14 minutes rather than 9 hours." Taking a look at this, the ~10h for your chr8 pangenome graph with 4643780 nodes and 6468759 edges doesn't look slow when comparing with CPU based ForceAtlas2 implementations. As I just discovered https://github.com/govertb/GPUGraphLayout, I will find out, if we could plug in our graphs producing a reasonable layout within reasonable time. However, as Erik mentioned before, such an algorithm will just ignore our paths.

So, for a CPU based implementation the PG-SGD doesn't seem slow when taking a look at the very limit literature I listed above. However, a GPU implementation or a modified GraphX Apache Spark could be a scaling solution here. This would open doors to other dimensions.

If you have tips or ideas to improve the algorithm, please share!

tonyjie commented 2 years ago

Thanks for your comment and detailed explanation! I see that you have already tried a lot of graph layout algorithms and implement the PG-SGD at last with series of optimizations. And the Path structure is especially important on both time & memory efficiency and achieving biological-specific goals for layout results.

I'm also thinking about GPU implementation the PG-SGD. A GPU-based implementation of the ForceAtlas2 their 40x and 123x speedup result on ForceAtlas2 also makes this plan promising.

I think we can try:

What do you think about it?

Some other detailed questions here:

Utilizing the Hogwild! approach, the algorithm scales by the given number of threads. This enables us to calculate the layouts of extremely large graphs.

I think the analogy to Hogwild! here is saying that multiple CPU threads are finding the terms and update coordiantes (parameters analogy to deep learning models) in parallel. But the implementation is using std::atomic to load / store coordiantes atomically. So it is still with lock instead of lock-free? Could you explain what's the consideration here? Is there any "undefined behavior" happening if we remove the lock? Because from the concept-level, the lock-free implementation should allow multiple threads to load / store this coordinates in the same time.

Please correct me if I'm wrong!

[Edited] To correct myself about Hogwild! The asynchronous coordinates update refers to the (X,Y) update step can be overlapped with the (dx, dy) calculations. Hogwild! allows computing (dx, dy) based on previous (X0, Y0), and appling it to update on the new (X1, Y1).

I guess std::atomic is required for multi-thread implementations to avoid undefined behavior?

subwaystation commented 2 years ago

I think it's a cool idea, please feel free to try it out. Let's have a chat about this. You can contact me via simon.heumos@qbic.uni-tuebingen.de or @subwaystation:matrix.org.

Indeed, you are right about our Hogwild! implementation. Since we are using a std::atomic vector, we are not lock-free. We kind of missed this....., but, in theory, if we have a large number nodes, the possibility that we update a node pair while another thread is reading from this node pair is very, very tiny. So in practice we may not need an atomic vector which uses .load() and .store() eating some computation time. Quoting the paper "However, when the data access is sparse, meaning that individual SGD steps only modify a small part of the decision variable, we show that memory overwrites are rare and that they introduce barely any error into the computation when they do occur. We demonstrate both theoretically and experimentally a near linear speedup with the number of processors on commonly occurring sparse learning problems." Learning from this we will exchange our std::atomic vector with a default vector. Hopefully, this will work, too.

According to https://stackoverflow.com/questions/12260946/c-access-to-vector-from-multiple-threads it's a good idea to use atomic vectors. Using default vectors, maybe the memory overwrite won't alter the expected layout significantly.

tonyjie commented 2 years ago

Yes I see the point. Thanks and we can keep in touch in this thread or in email :)

ekg commented 2 years ago

Note that we use a vector of atomic integers, which is lock free globally, not an atomic vector of integers, which would lock.

Using a plain vector is likely to result in memory corruption. Worth a test if you're curious.

On Fri, Jun 3, 2022, 16:50 Jiajie Li @.***> wrote:

Yes I see the point. Thanks and we can keep in touch in this thread or in email :)

— Reply to this email directly, view it on GitHub https://github.com/pangenome/odgi/issues/415#issuecomment-1146042443, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEP3DLJB3Z2CB6TY4SLVNILUBANCNFSM5XTTELQQ . You are receiving this because you commented.Message ID: @.***>

AndreaGuarracino commented 2 years ago

I tried to replace the vector of atomic integers with a vector of integers (https://github.com/pangenome/odgi/commit/298182452b36a2e35529b2158f2c0dcdccef812d) and execute odgi sort 10 times in both cases (atomic and not atomic) on a human chromosome 20 pangenome graph from 90 haplotypes:

#length    nodes    edges    paths
187070843  2820570  3969999  824

I've used 48 threads each time. The differences are subtle:

image

The runtime seems a little bit unstable with the vector of integers, but the mean link length tends to be slightly shorter.

I tried a single run in 2D with odgi layout too (https://github.com/pangenome/odgi/commit/c516b893a3e75091e3ea43e516f0744fbec7698c), but with the vectors of integers I've got a worse runtime.

tonyjie commented 2 years ago

I tried to replace the vector of atomic integers with a vector of integers (2981824) and execute odgi sort 10 times in both cases (atomic and not atomic) on a human chromosome 20 pangenome graph from 90 haplotypes:

#length    nodes    edges    paths
187070843  2820570  3969999  824

I've used 48 threads each time. The differences are subtle:

image

The runtime seems a little bit unstable with the vector of integers, but the mean link length tends to be slightly shorter.

I tried a single run in 2D with odgi layout too (c516b89), but with the vectors of integers I've got a worse runtime.

Thanks for the results. Seems that they have similar performance in 1D sort, and a plain vector doesn't result in memory corruption error.

ekg commented 2 years ago

https://stackoverflow.com/questions/54188/are-c-reads-and-writes-of-an-int-atomic#54242

It doesn't seem there is a performance gain, and the total space should be the same. Given the apparent potential for issues on different architectures than what we've tested, I would be uncomfortable using a plain vector.

On Sat, Jun 4, 2022, 22:43 Jiajie Li @.***> wrote:

I tried to replace the vector of atomic integers with a vector of integers (2981824 https://github.com/pangenome/odgi/commit/298182452b36a2e35529b2158f2c0dcdccef812d) and execute odgi sort 10 times in both cases (atomic and not atomic) on a human chromosome 20 pangenome graph from 90 haplotypes:

length nodes edges paths

187070843 2820570 3969999 824

I've used 48 threads each time. The differences are subtle:

[image: image] https://user-images.githubusercontent.com/62253982/171991382-8965fdd9-646d-4210-bbaf-e4706503ccb1.png

The runtime seems a little bit unstable with the vector of integers, but the mean link length tends to be slightly shorter.

I tried a single run in 2D with odgi layout too (c516b89 https://github.com/pangenome/odgi/commit/c516b893a3e75091e3ea43e516f0744fbec7698c), but with the vectors of integers I've got a worse runtime.

Thanks for the results. Seems that they have similar performance in 1D sort, and a plain vector doesn't result in memory corruption error.

— Reply to this email directly, view it on GitHub https://github.com/pangenome/odgi/issues/415#issuecomment-1146683145, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEPD72VD5A7LQPGD4Q3VNO5YRANCNFSM5XTTELQQ . You are receiving this because you commented.Message ID: @.***>

subwaystation commented 2 years ago

I agree with @ekg. Better keep the atomic vector.