rapidsai / cugraph

cuGraph - RAPIDS Graph Analytics Library
https://docs.rapids.ai/api/cugraph/stable/
Apache License 2.0
1.7k stars 301 forks source link

[BUG] Illegal Memory Access when running neighborhood samppling #2446

Closed jnke2016 closed 2 years ago

jnke2016 commented 2 years ago

Describe the bug A bug was unveiled when running uniform neighbor sampling on a single GPU. Upon initial investigation, it appears to be caused by the cython.cu renumbering and skipping it (legacy_renum_only=True) seems to resolve the issue(yet to be verified). However,, skipping the C++ renumbering is not a viable solution.

Steps/Code to reproduce bug

import cudf
import cugraph
import numpy as np

df = cudf.DataFrame({'src':[2,3,1],
                       'dst':[4,5,2]}).astype(np.int32)
df['weight']=1.0
G = cugraph.Graph(directed=True)
G.from_cudf_edgelist(df, source='src',destination='dst',edge_attr='weight')

for _ in range(0,2):
    seed_list = cudf.Series([2]).astype(np.int32)
    sampled_g_cugraph = cugraph.uniform_neighbor_sample(G,
                                         start_list=seed_list,
                                         fanout_vals=[-1],
                                         with_replacement=False)
VibhuJawa commented 2 years ago

MRE with legacy_renum_only=True and same datatype .


import cugraph 
import cudf
import numpy as np

src_ser = cudf.Series([2, 3, 4, 5, 6, 3, 4, 7]).astype(np.int64)
dst_ser = cudf.Series([1, 1, 1, 1, 1, 2, 2, 3]).astype(np.int64)

df = cudf.DataFrame({'src':src_ser, 'dst':dst_ser})
df['weight']=1.0

G = cugraph.Graph(directed=True)

G.from_cudf_edgelist(df, source='src',destination='dst',edge_attr='weight', legacy_renum_only=True)
seed_list = cudf.Series([1]).astype(np.int64)

for _ in range(0,10):
    g = cugraph.uniform_neighbor_sample(G,
                                         start_list=seed_list,
                                         fanout_vals=[-1],
                                         with_replacement=False)

    print(g)
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Input In [1], in <cell line: 16>()
     14 seed_list = cudf.Series([1]).astype(np.int64)
     16 for _ in range(0,10):
---> 17     g = cugraph.uniform_neighbor_sample(G,
     18                                          start_list=seed_list,
     19                                          fanout_vals=[-1],
     20                                          with_replacement=False)
     24     print(g)

File /datasets/vjawa/miniconda3/envs/cugraph_11_5_1_aug/lib/python3.9/site-packages/cugraph/sampling/uniform_neighbor_sample.py:92, in uniform_neighbor_sample(G, start_list, fanout_vals, with_replacement, is_edge_ids)
     88     else:
     89         start_list = G.lookup_internal_vertex_id(start_list)
     91 sources, destinations, indices = \
---> 92     pylibcugraph_uniform_neighbor_sample(
     93         resource_handle=ResourceHandle(),
     94         input_graph=G._plc_graph,
     95         start_list=start_list,
     96         h_fan_out=fanout_vals,
     97         with_replacement=with_replacement,
     98         do_expensive_check=False
     99     )
    101 df = cudf.DataFrame()
    102 df["sources"] = sources

File uniform_neighbor_sample.pyx:142, in pylibcugraph.uniform_neighbor_sample.uniform_neighbor_sample()

File utils.pyx:51, in pylibcugraph.utils.assert_success()

RuntimeError: non-success value returned from cugraph_uniform_neighbor_sample: CUGRAPH_UNKNOWN_ERROR
alexbarghi-nv commented 2 years ago

@VibhuJawa I cannot replicate this. I think you may have a CUDA version mismatch. The code you posted gives me the expected output (in this case no src/dst pairs) even when run many times. On my side, Chuck's PR appears to have fixed the intermittent errors.

VibhuJawa commented 2 years ago

@VibhuJawa I cannot replicate this. I think you may have a CUDA version mismatch. The code you posted gives me the expected output (in this case no src/dst pairs) even when run many times. On my side, Chuck's PR appears to have fixed the intermittent errors.

Gotcha. Yeah, so this was just another reproducer with legacy_renum_only=True and same datatype . There was some confusion about wether this error manifests with legacy_renum_only=True or different datatype , i wanted to have an example for that .

Thanks for verifying that Chuck's PR fixes stuff.

alexbarghi-nv commented 2 years ago

Sure, I was able to run with legacy_renum_only=True and legacy_renum_only=False after Chuck's PR.