CuGraph hungarian matching slower than CPU based SciPy implemenation

JamesMcCullochDickens commented 1 month ago

I have to compute a lot of hungarian matchings between sets of points using their distance as the matching criterion. So far I have tried this hybrid Scipy (CPU) and PyTorch method:

import numpy as np
from scipy.optimize import linear_sum_assignment
from scipy.spatial.distance import cdist
import torch

def torch_cdist(p1: np.ndarray, p2: np.ndarray) -> np.ndarray:
    p1 = torch.tensor(p1).to(0)
    p2 = torch.tensor(p2).to(0)
    cdist = torch.cdist(p1, p2, p=2)
    return cdist.cpu().numpy()

def hungarian_matching_pcds(points1: np.ndarray, points2: np.ndarray, to_gpu: bool = False) -> List[Tuple]:
    if to_gpu:
        cost_matrix = torch_cdist(points1, points2)
    else:
        cost_matrix = cdist(points1, points2)
    row_ind, col_ind = linear_sum_assignment(cost_matrix)
    matches = list(zip(row_ind, col_ind))
    return matches

As well as using cugraph and pylibraft

import numpy as np
import cupy as cp
import cugraph
from pylibraft.distance import pairwise_distance
import cudf
def create_bipartite_graph(weights: cp.ndarray) -> cugraph.Graph:
    N = weights.shape[0]
    x = cp.arange(N)
    y = cp.arange(N)
    destinations, sources = cp.meshgrid(x, y, indexing='ij')
    destinations = cp.ravel(destinations) + N
    sources = cp.ravel(sources)
    weights_r = cp.array(cp.ravel(weights), dtype=cp.float32)
    data = {"source": sources,
            "destination": destinations,
            "weights": weights_r}
    df = cudf.DataFrame(data)
    G = cugraph.Graph()
    G.from_cudf_edgelist(input_df=df, edge_attr="weights")
    return G

def get_hungarian_matches_cugraph(pcd1: np.ndarray, pcd2: np.ndarray) -> np.ndarray:
    num_points = pcd1.shape[0]
    pcd1_cp, pcd2_cp = cp.asarray(pcd1), cp.asarray(pcd2)
    distances_cp = cp.array(pairwise_distance(pcd1_cp, pcd2_cp))
    graph = create_bipartite_graph(distances_cp)
    workers = cudf.Series(cp.arange(num_points), dtype='int64')
    cost, df = cugraph.hungarian(graph, workers)
    idx1 = np.expand_dims(cp.asnumpy(df.values[:, 0]), axis=-1)
    idx2 = np.expand_dims(cp.asnumpy(df.values[:, 1] - num_points), axis=-1)
    return np.concatenate([idx1, idx2], axis=-1)

I was expecting the cugraph approach to be much faster, but it is actually significantly slower. For instance for two sets of points of shape (1024, 3), the scipy approach takes about 0.03 seconds, but the cugraph approach takes about 1.3 seconds on my RTX 4090. I'm wondering if there is anything I am doing wrong to account for this?

ChuckHastings commented 1 month ago

It's not clear to me that you are doing anything wrong, per se. I can suggest some improvements for you and observe some limitations of our implementation.

Suggestion for you, there is a dense_hungarian method in our cugraph API that might be a bit more efficient. It takes a flattened 2-D matrix representation as input along with the number of rows and columns. Ultimately our implementation is going to convert the sparse matrix you constructed in the cugraph.Graph object and convert it back into a dense matrix... so by taking your 2D matrix and converting it into cugraph.Graph you're actually paying the cost to convert between data structures that is unnecessary. The API you are calling is designed for if you already have a cugraph.Graph object for some other algorithms. That might save you a little bit of cost and memory.

I'm not sure of the exact representation of your data. The Date/Nagi implementation of the Hungarian method that we have incorporated into cugraph is more efficient on integer weights. If you either have integer weights or can accept the loss of precision by converting to integer weights (multiply by some value and truncate/round to an integer) you might get better efficiency. There are multiple algorithms for implementing the Hungarian method, I believe that the scipy version is based on Jonker-Volgenant which will have different strengths.

Regarding limitations in the cugraph implementation, we haven't done any performance tuning of our python wrapper for this algorithm. In a simple test trying to replicate your issue I was able to observe some inefficiencies in our python layer (mostly extra data copying) that is occurring that will add some overhead to the call. Our C++ implementation supports a feature for doing multiple assignment computations at once (pass multiple dense matrices in and the algorithm will compute multiple Hungarian assignments concurrently). This would potentially benefit you. In a 1K by 1K matrix, several of the Hungarian steps are going to leave the GPU starving for parallel work. However, we don't expose this through our python API.

Unfortunately, we don't currently have plans on improving the Hungarian algorithm. After implementing this, we observed that cugraph is really focused on sparse graph algorithms, and the Hungarian algorithm is really a dense graph algorithm. We haven't deprecated the functionality at this point... but we have discussed that in the past.

JamesMcCullochDickens commented 1 month ago

@ChuckHastings

Thanks so much for the detailed response. I will try to do batches of graphs, the integer weights, and other things you mentioned and share the code here. Other than that, my hacky solution to my issue so far is to order the point clouds by z order, then do "windowed" hungarian matchings on the windows with scipy, which is fast enough for my use case albeit not as optimal, but thats another story. The issue can be closed.

rapidsai / cugraph

CuGraph hungarian matching slower than CPU based SciPy implemenation #4695