Open yunshiuan opened 3 years ago
I typically use graph-tool (python) or lightgraphs (julia) and both are fine memory wise. Lightgraph even has a squash method that returns the smallest int type required to represent the graph. Networkx is less efficient especially if you are using a multigraph as it uses a dict structure instead of a adjacency list (if I recall correctly).
Thank you! This is very helpful :) I’ll check them out. Btw, have you considered including GraphX in the scope? Even with one single machine, the Spark-based package might do a good job at parallelizing the computation.
Nope, it was my bad experience using spark for network algorithms that motivated this benchmark to find a better alternative. My hunch is that it would be at networkx level of performance and even worse if distributed.
If you are mainly working with the graph structure itself, I find it's often beneficial to put the graph in memory using one of these libraries and have an external (possibly distributed) storage for the meta information.
Thank you for sharing your experiences!
Amazing work.
Between NetworkKit and Graph-Tool which one do you consider to be more efficient in terms of memory usage?
I have an undirected weighted graph with 50M nodes and 100M edges, I tried several Python libraries and the only library that supported this workload was NetworkKit. The graph takes about 6gb of ram (7gb during creation). My main use case is shortest path queries.
I haven't tried Graph-Tools but if the memory footprint is worse it won't solve my problem, even if the shortest path queries are faster, as your benchmark showed.
I'm using Databricks and a Spark Cluster, I also tried GraphFrames (distributed) with a 5 node cluster, but for shortest paths and most types of queries this lib is trash. All other libs I've tested are running on the cluster's driver machine, since they support multi-threading they're using all cores (8 cores, 28gb ram).
Considering a graph this size, in your opinion is Graph-Tool worth a try?
It's definitely worth trying with graph-tool. It uses the C++ Boost library internally and should be on par with NetworkKit in terms of memory usage. It's also pretty easy to install the package and try it out!
Distributed graph computation is a hard task and I try to avoid spark as much as possible.
I know you said that memory consumption are out of scope of your study, but I am curious about your intuition on this. I am looking for a package that's the most memory efficient. My raw list of edges (in numpy) takes 16 GB, but when creating a networkx instance from the edge list, it requires more than 64GB of memory :(