rapidsai / cugraph

cuGraph - RAPIDS Graph Analytics Library
https://docs.rapids.ai/api/cugraph/stable/
Apache License 2.0
1.69k stars 300 forks source link

[QST]: MST cuGraph RuntimeError: RAFT failure, how to fix? #4458

Open stephwon opened 4 months ago

stephwon commented 4 months ago

What is your question?

I'm trying to conduct a Steiner Tree approximation with several terminal nodes (and subgraphs I need to analyze). However, whenever I try to iterate through, it fails at MST despite the graph being Undirected: RuntimeError: RAFT failure at file=/scratch/user/anaconda3/envs/rapids-24.04/include/raft/sparse/solver/detail/mst_solver_inl.cuh line=152:

The code works if I run through terminal nodes for each subgraph (one at a time). I already checked for self-loops, isolated nodes, or disconnected components in the big graph data, and they don't exist. The data types are to the Dask cugraph standard, so there is no issue there. Even if I change `DiG = cugraph.Graph(directed=False)' problem persists with the same error message. I'm not sure how to address this error... or am I missing something here? Graph data (df_graph) has 46.6M edges, directed and unweighted. My code below:

#Set working directory
import os
os.chdir('/scratch') 

# Install required packages
import pandas as pd
import matplotlib.pyplot as plt
import yaml
import sys 
import csv

# Install Networka
import networkx as nx

import cudf
cudf.__version__

import cugraph
cugraph.__version__

# Import Dask to utilize Muti-GPU
import dask_cudf
from dask.distributed import Client
from dask_cuda import LocalCUDACluster

import cugraph.dask as dcg
import cugraph.dask.comms.comms as Comms
from cugraph.generators.rmat import rmat

# Initialize Dask CUDA cluster and client
cluster = LocalCUDACluster()
client = Client(cluster)

# Initialize cuGraph communication
Comms.initialize(p2p=True)

columns_to_read = ['subject', 'object', 'predicate']

# Read the TSV file using Dask cuDF with specific columns
df_graph = dask_cudf.read_csv('big_graph_edge.tsv', sep='\t', 
                        usecols=columns_to_read)

# Renaming columns to match cuGraph requirements
df_graph = df_graph.rename(columns={'subject': 'source', 'object': 'destination'})

print(df_graph.head())

# Create graph from input data
DiG = cugraph.Graph(directed=True)
DiG.from_dask_cudf_edgelist(df_graph, source = 'source', destination = 'destination' )

# Verify the graph has edges
print("Number of edges in the graph:", DiG.number_of_edges())

df = pd.read_csv('/scratch/subgraph_nodes.csv', sep = ",")
df

# Define a function to clean each c_id list
def get_terminals(c_ids):
    # Remove the leading and trailing square brackets
    c_ids = c_ids.strip("[]")
    # Split by comma and clean each element
    return [c_id.strip().strip("'") for c_id in c_ids.split(',')]

# Apply the cleaning function to the 'c_ids' column and convert to list of lists
terminals = df['c_ids'].head().apply(lambda x: get_terminals(x) if pd.notnull(x) else []).tolist()
terminals # output is list of lists

# Filter out terminals that are not in the graph
filt_subgraphs = [[t for t in terminal if DiG.has_node(t)] for terminal in terminals]

# Check if there are valid terminals
if not filt_subgraphs:
    raise ValueError("None of the terminal vertices are present in the graph")

 def bfs_with_cache(graph, start, cache):
    if start in cache:
        return cache[start]
    distances = dcg.bfs(graph, start=start)
    distances_df = distances.compute().to_pandas()
    cache[start] = distances_df
    return distances_df

 def get_edges(lst,DiG,cache):
    edges = []

    # Find the shortest paths between all pairs of terminal vertices using BFS
    for l in lst:
        distances_df = bfs_with_cache(DiG,l,cache)
        filtered_df = distances_df[distances_df['vertex'].isin(lst)]

        for _, row in filtered_df.iterrows():
            if row['vertex'] != l:
                edges.append((l, row['vertex'], row['distance']))

    # Create a new DataFrame to hold the edges of the shortest path graph
    edges_df = pd.DataFrame(edges, columns=['source', 'destination', 'weight'])
    return edges_df

def get_mst_results(edges_df):
    # Create a new graph for MST purpose
    mst_graph = cugraph.Graph(directed=False)
    mst_graph.from_pandas_edgelist(edges_df, source='source', destination='destination', edge_attr='weight')

    # Compute the MST
    mst = cugraph.minimum_spanning_tree(mst_graph)

    # Extract the edge list of the MST
    mst_edges = mst.view_edge_list()
    mst_df = mst_edges.to_pandas()

    # Get number of edges and nodes
    num_edges = mst_df.shape[0]
    num_nodes = len(pd.concat([mst_df['src'], mst_df['dst']]).unique())

    return num_edges, num_nodes

cache = {}

# need access to ids, associate edges to graph ids

#print("Graph ID:")

### Main code execution
for subgraph in filt_subgraphs:
    graph_edges = get_edges(subgraph,DiG,cache)
    n_edges,n_nodes = get_mst_results(graph_edges)

print(f"Number of edges in the MST: {n_edges}")
print(f"Number of nodes in the MST: {n_nodes}")
print("\n")

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[21], line 10
      8 for subgraph in filt_subgraphs:
      9     graph_edges = get_edges(subgraph,DiG,cache)
---> 10     n_edges,n_nodes = get_mst_results(graph_edges)
     12 print(f"Number of edges in the MST: {n_edges}")
     13 print(f"Number of nodes in the MST: {n_nodes}")

Cell In[16], line 7, in get_mst_results(edges_df)
      4 mst_graph.from_pandas_edgelist(edges_df, source='source', destination='destination', edge_attr='weight')
      6 # Compute the MST
----> 7 mst = cugraph.minimum_spanning_tree(mst_graph)
      9 # Extract the edge list of the MST
     10 mst_edges = mst.view_edge_list()

File /scratch/user/anaconda3/envs/rapids-24.04/lib/python3.11/site-packages/cugraph/tree/minimum_spanning_tree.py:105, in minimum_spanning_tree(G, weight, algorithm, ignore_nan)
    103     return cugraph_to_nx(mst)
    104 else:
--> 105     return _minimum_spanning_tree_subgraph(G)

File /scratch/user/anaconda3/envs/rapids-24.04/lib/python3.11/site-packages/cugraph/tree/minimum_spanning_tree.py:26, in _minimum_spanning_tree_subgraph(G)
     24 if G.is_directed():
     25     raise ValueError("input graph must be undirected")
---> 26 mst_df = minimum_spanning_tree_wrapper.minimum_spanning_tree(G)
     27 if G.renumbered:
     28     mst_df = G.unrenumber(mst_df, "src")

File minimum_spanning_tree_wrapper.pyx:71, in cugraph.tree.minimum_spanning_tree_wrapper.minimum_spanning_tree()

File minimum_spanning_tree_wrapper.pyx:52, in cugraph.tree.minimum_spanning_tree_wrapper.mst_double()

RuntimeError: RAFT failure at file=/scratch/user/anaconda3/envs/rapids-24.04/include/raft/sparse/solver/detail/mst_solver_inl.cuh line=152: 

Code of Conduct

ChuckHastings commented 3 months ago

Is this a dataset you can share? This is not an error I've seen with some of our existing data sets. The call is somehow getting more answers than should be possible. It's probably a bug that only occurs with a specific dataset.