rapidsai / cugraph

cuGraph - RAPIDS Graph Analytics Library
https://docs.rapids.ai/api/cugraph/stable/
Apache License 2.0
1.77k stars 304 forks source link

Enable Multi-Source BFS for arbitrary sources #3202

Open GregorySchwing opened 1 year ago

GregorySchwing commented 1 year ago

Rewriting our objective here.

The current implementation of multi-source BFS only provides correct results if the list of sources are on separate connected components. The request here is to add support for multi-source support where the source vertices may be in the same connected component.

Original Request

Version

22.12.00+0.g8474cfcf.dirty

Which installation method(s) does this occur on?

Docker

Describe the bug.

From the BFS documentation

start Integer or list, optional (default=None)

The id of the graph vertex from which the traversal begins, or if a list, the vertex from which the traversal begins in each component of the graph. Only one vertex per connected component of the graph is allowed.

The requirement that "Only one vertex per connected component of the graph is allowed." was stated to be deprecated by Brad Rees here : https://stackoverflow.com/questions/70632337/how-to-accelerate-finding-all-pairs-shortest-path-with-gpu-using-rapids-cugraph

For cugraphs created from cudf, passing an array as the start argument works fine. However, graph types verify the validity of the start vertices differently, and the way a list is evaluated by the in operator on a graph type returns false when it should return true.

Reason lines 50-54 of cugraph/cugraph/traversal/bfs.py

"""

ensure start vertex is valid

    invalid_vertex_err = ValueError("A provided vertex was not valid")
    if is_nx_graph_type(G_type):
        **if start not in G:**
            raise invalid_vertex_err

""" Fix is to Replace if start not in G with any(v not in G for v in start)) example.zip

Minimum reproducible example

#!/usr/bin/env python
# coding: utf-8

# In[ ]:

# In[69]:

# define a print path function that take the dataframe and a vertex ID

def print_path(df, id):

    p = cugraph.utils.get_traversed_path_list(df, id)
    print(p)

# In[70]:

# Import a built-in dataset
from cugraph.experimental.datasets import karate
import cugraph

gdf = karate.get_edgelist(fetch=True)
# Look at the first few data records - the output should be two columns: 'src' and 'dst'
print(gdf.head())
# create a Graph 
G_cudf_type = cugraph.Graph()
G_cudf_type.from_cudf_edgelist(gdf, source='src', destination='dst', renumber=False)

# In[71]:

import networkx as nx
G_graphType = nx.karate_club_graph()
# cugraph from_cudf_edgelist isnt iterable!
#neighbors = [n for n in G.neighbors(0)]
neighbors = [n for n in G_graphType.neighbors(0)]
print(neighbors)

# In[72]:

srcs = [1,2]
# Call BFS on the graph starting from vertex 1
df = cugraph.bfs(G_cudf_type,srcs)
# Let's take a looks at the structure of the returned dataframe
df.dtypes

# In[73]:

# Path to a src
print_path(df, 1)
# Path to a src
print_path(df, 2)
# See who won the race to a shared neighbor 0
print_path(df, 0)

# In[74]:

# Call BFS on the graph starting from vertex 1
srcs = [1,2,3]
df2 = cugraph.bfs(G_graphType,srcs)
# Let's take a looks at the structure of the returned dataframe
df2.dtypes

# In[75]:

pdf = nx.to_pandas_edgelist(G_graphType, source='src', target='dst')
import cudf
import pandas as pd
df_from_graph_type = cudf.from_pandas(pdf)
print(df.head)

# In[76]:

# create a Graph 
G_graph_type_2_cudf_type = cugraph.Graph()
G_graph_type_2_cudf_type.from_cudf_edgelist(df_from_graph_type, source='src', destination='dst', renumber=False)

# In[77]:

# Call BFS on the graph starting from vertex 1
srcs = [1,2,3]
df2 = cugraph.bfs(G3,srcs)

# In[78]:

# Path to a src
print_path(df2, 1)
# Path to a src
print_path(df2, 2)
# See who won the race to a shared neighbor 0
print_path(df2, 0)

# In[60]:

#MSBFS only works on cugraphs created from cudf_edge lists

# In[89]:

# Reason lines 50-54 of cugraph/cugraph/traversal/bfs.py
"""
        # ensure start vertex is valid
        invalid_vertex_err = ValueError("A provided vertex was not valid")
        if is_nx_graph_type(G_type):
            if start not in G:
                raise invalid_vertex_err
"""
# This line
# if start not in G:
# always returns false for lists
srcs = [1,2,3]
print("Check if list in graph: ", srcs in G_graphType)
print("Check if each element of the list is in the graph: ", not(any(v not in G_graphType for v in srcs)))

# In[ ]:

Relevant log output

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_306/950571499.py in <module>
      2 srcs = [1,5]
      3 print(srcs in G2)
----> 4 df = cugraph.bfs(G2,srcs)

/opt/conda/envs/rapids/lib/python3.9/site-packages/cugraph/traversal/bfs.py in bfs(G, start, depth_limit, i_start, directed, return_predecessors)
    206 
    207     """
--> 208     (start, directed) = _ensure_args(G, start, i_start, directed)
    209 
    210     # FIXME: allow nx_weight_attr to be specified

/opt/conda/envs/rapids/lib/python3.9/site-packages/cugraph/traversal/bfs.py in _ensure_args(G, start, i_start, directed)
     52         if is_nx_graph_type(G_type):
     53             if start not in G:
---> 54                 raise invalid_vertex_err
     55         else:
     56             if not isinstance(start, cudf.DataFrame):

ValueError: A provided vertex was not valid

Environment details

Docker rapids

Other/Misc.

example.zip

Code of Conduct

ChuckHastings commented 1 year ago

The specific problem mentioned here is an issue in our python code... reassigned to Rick.

@GregorySchwing - There is still an assumption that if you pass multiple seeds they are from different components of the graph. There is no error check (this is not forced on the caller), nor is there a warning if you do. So you may call with multiple seeds from the same component.

If you pass multiple seeds that are from the same component be aware of what you can expect. This is a frontier-based algorithm. You will traverse from all of the seeds as if they are in frontier 0. As each frontier is expanded in parallel, the vertices of that frontier will race to neighbors in the next frontier. If the same vertex that has not been visited in a previous frontier is encountered, one of the paths will win and the others will lose and not be included in the back pointers.

The most important thing to recognize is that the result of passing vertices u and v which are on the same component will not be the same as if you passed u to a call to bfs (and extracted the paths) and then passed v to a call to bfs (and extracted the paths) and merged the results. Each vertex t in the same component as u and v will have a path in the result set either from u OR from v. If the shortest path from v to t and the shortest path fromu to t are of different lengths, you'll get the shorter path. If they are the same length, you'll get whichever path won the race.

GregorySchwing commented 1 year ago

The specific problem mentioned here is an issue in our python code... reassigned to Rick.

@GregorySchwing - There is still an assumption that if you pass multiple seeds they are from different components of the graph. There is no error check (this is not forced on the caller), nor is there a warning if you do. So you may call with multiple seeds from the same component.

If you pass multiple seeds that are from the same component be aware of what you can expect. This is a frontier-based algorithm. You will traverse from all of the seeds as if they are in frontier 0. As each frontier is expanded in parallel, the vertices of that frontier will race to neighbors in the next frontier. If the same vertex that has not been visited in a previous frontier is encountered, one of the paths will win and the others will lose and not be included in the back pointers.

The most important thing to recognize is that the result of passing vertices u and v which are on the same component will not be the same as if you passed u to a call to bfs (and extracted the paths) and then passed v to a call to bfs (and extracted the paths) and merged the results. Each vertex t in the same component as u and v will have a path in the result set either from u OR from v. If the shortest path from v to t and the shortest path fromu to t are of different lengths, you'll get the shorter path. If they are the same length, you'll get whichever path won the race.

I understand, and I am interested in using MSBFS to find paths to vertices which are equidistant from two arbitrary seeds for augmenting path max matching algorithms.

kingmesal commented 1 year ago

Changing this to a feature request as this is not a bug.

ChuckHastings commented 12 months ago

@GregorySchwing - starting to look at this request to plan when we can execute this.

Do you have an idea of how many concurrent traversals you would want to do in a single call? Supporting multiple concurrent traversals will add memory pressure as we need to keep track of back pointers for all of the vertices we visit.

GregorySchwing commented 12 months ago

This is no longer an interest of mine. The reason I opened the issue was because the capacity to do this already exists depending on the format the graph is passed as (cuDF vs networkx).

"For cugraphs created from cudf, passing an array as the start argument works fine. However, graph types verify the validity of the start vertices differently, and the way a list is evaluated by the in operator on a graph type returns false when it should return true."