rapidsai / cugraph

cuGraph - RAPIDS Graph Analytics Library
https://docs.rapids.ai/api/cugraph/stable/
Apache License 2.0
1.7k stars 301 forks source link

[BUG]: PLC inconsistent handling of isolated nodes; example using WCC #3974

Closed eriknw closed 11 months ago

eriknw commented 11 months ago

Version

23.12 dev

Which installation method(s) does this occur on?

Source

Describe the bug.

PLC sometimes handles isolated nodes inconsistently, and we are forced to handle the edge cases in Python (if possible).

I will illustrate this with plc.weakly_connected_components, but the same behavior has been seen elsewhere such as with Louvain.

Let's consider an undirected graph with edge indices [1, 2] and [2, 1] in COO format. There is currently no way to tell PLC how many nodes are in this graph, and it assumes the number of nodes is max(indices) + 1. Here's the first example:

import cupy as cp
import numpy as np
import pylibcugraph as plc

src_indices = cp.array([1, 2], np.int32)
dst_indices = cp.array([2, 1], np.int32)

G = plc.SGGraph(
    resource_handle=plc.ResourceHandle(),
    graph_properties=plc.GraphProperties(
        is_multigraph=False,
        is_symmetric=True,
    ),
    src_or_offset_array=src_indices,
    dst_or_index_array=dst_indices,
    input_array_format="COO",
    weight_array=None,
    store_transposed=False,
    renumber=False,
    do_expensive_check=False,
)
node_ids, labels = plc.weakly_connected_components(
    resource_handle=plc.ResourceHandle(),
    graph=G,
    offsets=None,
    indices=None,
    weights=None,
    labels=None,
    do_expensive_check=False,
)
print(node_ids)
print(labels)

which outputs

[0 1 2]
[0 2 2]

This is almost perfect, right? Because we gave renumber=False when creating G, it assumes nodes are 0, 1, and 2 (b/c of max indices). But, what if we had four (or more) nodes in the graph and node 3 (and above) have no edges? Then we have no results for them, and we have no way to tell PLC to create results for them.

What if instead of COO we used CSR? One might expect CSR to work since the number of nodes is implied by the length of the offsets array. With WCC, we have two ways to use CSR: when creating the graph, or when calling WCC.

    # Change creation of G to use this
    src_or_offset_array=cp.array([0, 0, 1, 2, 2], np.int32),
    dst_or_index_array=cp.array([2, 1], np.int32),
    input_array_format="CSR",
    # Or call plc.weakly_connected_components with this
    graph=None,
    offsets=cp.array([0, 0, 1, 2, 2], np.int32),
    indices=cp.array([2, 1], np.int32),

Both approaches give the same result as before (only nodes 0, 1, 2).

It seems the fundamental issue is that COO contains incomplete information about the graph. If we are not renumbering, then I want to tell PLC the number of nodes in the graph. If we are renumbering, then I want to give it an array of nodes. Then I want results for each node.

Is this something that should be handled in the PLC/C layers?

Minimum reproducible example

See above

Relevant log output

No response

Environment details

No response

Other/Misc.

No response

Code of Conduct

eriknw commented 11 months ago

This is also an issue with plc.sssp when the source index is greater than the indices found in COO. plc.bfs appears to behave okay in this case.

ChuckHastings commented 11 months ago

Will explore this in conjunction with #3947