rapidsai / cugraph

cuGraph - RAPIDS Graph Analytics Library
https://docs.rapids.ai/api/cugraph/stable/
Apache License 2.0
1.68k stars 300 forks source link

[FEA]: Markov Clustering for large protein similarity networks #3931

Open sunitj opened 11 months ago

sunitj commented 11 months ago

Is this a new feature, an improvement, or a change to existing functionality?

New Feature

How would you describe the priority of this feature request

Critical (currently preventing usage)

Please provide a clear description of problem this feature solves

I would like cuGraph to implement the Markov clustering algorithm, because I would like to perform such a clustering on my graphs where, nodes are proteins, edges denote similarity between the nodes and edge weights are similarities. MCL is a commonly used algorithm in analyzing protein similarity networks. A recent publication performed such a clustering on a graph that consisted of 570,198,677 nodes and 5,196,499,560 edges using a distributed version of the algorithm, HipMCL. They required 2,500 compute nodes (170,000 compute cores) on the NERSC super computer for 3h20m. I estimate my graph to be around the same order of nodes and edges (if not more), but I don't have access to a supercomputer. I do have access to AWS and GPU instances. I should also add that I am completely new to the RAPIDS universe of tools, but quite excited and eager to learn.

Describe your ideal solution

The new function takes an existing graph as described above. The function would perform the following steps (borrowed from this repo)

import markov_clustering as mc
import networkx as nx
import random
from itertools import pairwise

def get_edge_list(nodes: list) -> list:
    numnodes = len(nodes)
    assert numnodes >= 2, f"Nodes list isn't long enough: {nodes}"
    return [(u, v, random.uniform(0, 1)) for u, v in pairwise(nodes) if u != v]

G = nx.DiGraph()
nodes = [i for i in range(1,101)]
edges = get_edge_list(nodes)
G.add_weighted_edges_from(edges)

# then get the adjacency matrix (in sparse form)
matrix = nx.to_scipy_sparse_matrix(G)

# cluster
result = mc.run_mcl(matrix, inflation=2)
clusters = mc.get_clusters(result)

# optionally tie into the visualization
mc.draw_graph(matrix, clusters, pos=positions, node_size=50, with_labels=False, edge_color="silver")

Describe any alternatives you have considered

I have considered:

but none of them have GPU acceleration and I don't have the resources to rent the NERSC supercomputer for 4 hours.

Additional context

Code of Conduct

ChuckHastings commented 11 months ago

Thanks for your interest!

We will review the linked papers and consider how this could be done in cuGraph and update this issue with information on where it might fit in our road map.

vincentvandeuren commented 11 months ago

I am also very much interested in an easily accessible GPU-accelerated MCL implementation!