rapidsai / cuml

cuML - RAPIDS Machine Learning Library
https://docs.rapids.ai/api/cuml/stable/
Apache License 2.0
4.23k stars 532 forks source link

[BUG] cuml.dask's kneighbors returns erroneous graph #5170

Open vdet opened 1 year ago

vdet commented 1 year ago

Describe the bug First of all, I am no expert. Please forgive me it the problem is an error of mine and not a bug.

I am trying to create a K nearest neighbors graph for 7,000,000+ points sitting in a 768 dimension space, with k=10.

Since a single GPU does not have enough RAM for the task, I reverted to the dask implementation of cuml nearest neighbors.

cuml.dask's kneighbors returns an erroneous index containing values much greater that the number of nodes in the graph.

Steps/Code to reproduce bug

import math
import numpy as np
import pandas as pd
from sklearn.datasets import make_blobs

from distributed import Client

import cupy as cp
import dask.array as da
from cuml.dask.neighbors import NearestNeighbors as cuDistNearestNeighbors

from dask_cuda import LocalCUDACluster

## This function construct the graph
def KNNGraph(df, k, client, metric='cosine',  batch_size=524288, n_workers=8) :
    chunk_size = math.ceil(df.shape[0]/n_workers)

    ## Create cudf object
    da_df = da.from_array(df.to_numpy(dtype=np.float32), chunks=chunk_size)

    ## Use k+1 because self loops are removed later
    model = cuDistNearestNeighbors(client=client, n_neighbors=k+1, metric=metric, batch_size=batch_size)
    model.fit(da_df)

    ## For some reasons I had to batchify calls to kneighbors due to an issue in cuml
    ## (https://github.com/rapidsai/cuml/issues/5056#issuecomment-1339626036).
    ## The 3 lines of code below should work in principle but didn't:
    ## distances, indices = model.kneighbors(da_df)
    ## indices = cp.asnumpy(indices)
    ## distances = cp.asnumpy(distances)
    distances, indices = model.kneighbors(da_df[0:batch_size,:])
    indices = cp.asnumpy(indices)
    distances = cp.asnumpy(distances)
    print(f"Max index; {indices.max()}") ##should be < df.shape[0] 
    for b in range(1, math.ceil(da_df.shape[0]/batch_size)) :
        dist, ind = model.kneighbors(da_df[b*batch_size:min(da_df.shape[0], (b+1)*batch_size), :])
        indices = np.concatenate([indices, cp.asnumpy(ind)], axis=0)
        print(f"Max index: {indices.max()}")
        distances = np.concatenate([distances, cp.asnumpy(dist)], axis=0)

    return distances[:, 1:], indices[:, 1:] 

## Generate data
df = pd.DataFrame(make_blobs(n_samples=7000000, centers=1, n_features=768, random_state=0)[0])

## Set up client
cluster = LocalCUDACluster(
    protocol="ucx",
    rmm_pool_size="1GB"
)
client = Client(cluster)

The last command producse the following system messages:

2023-01-26 17:32:09,234 - distributed.comm.ucx - WARNING - A CUDA context for device 0 (b'GPU-30a44e33-50f4-707f-dccc-e98b08e2cdf9') already exists on process ID 2654889. This is often the result of a CUDA-enabled library calling a CUDA runtime function before Dask-CUDA can spawn worker processes. Please make sure any such function calls don't happen at import time or in the global scope of a program.
2023-01-26 17:32:10,627 - distributed.diskutils - INFO - Found stale lock file and directory '/tmp/dask-worker-space/worker-cg7q0d0v', purging
2023-01-26 17:32:10,628 - distributed.diskutils - INFO - Found stale lock file and directory '/tmp/dask-worker-space/worker-12owxm0r', purging
2023-01-26 17:32:10,628 - distributed.diskutils - INFO - Found stale lock file and directory '/tmp/dask-worker-space/worker-z9rtbi3p', purging
2023-01-26 17:32:10,628 - distributed.diskutils - INFO - Found stale lock file and directory '/tmp/dask-worker-space/worker-01xxa6pf', purging
2023-01-26 17:32:10,628 - distributed.diskutils - INFO - Found stale lock file and directory '/tmp/dask-worker-space/worker-11hizl15', purging
2023-01-26 17:32:10,629 - distributed.diskutils - INFO - Found stale lock file and directory '/tmp/dask-worker-space/worker-27r_3i27', purging
2023-01-26 17:32:10,629 - distributed.diskutils - INFO - Found stale lock file and directory '/tmp/dask-worker-space/worker-pri9gdfp', purging
2023-01-26 17:32:10,629 - distributed.diskutils - INFO - Found stale lock file and directory '/tmp/dask-worker-space/worker-ldsnbyiy', purging
2023-01-26 17:32:10,630 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2023-01-26 17:32:10,630 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2023-01-26 17:32:10,650 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2023-01-26 17:32:10,650 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2023-01-26 17:32:10,711 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2023-01-26 17:32:10,711 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2023-01-26 17:32:10,745 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2023-01-26 17:32:10,745 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2023-01-26 17:32:10,775 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2023-01-26 17:32:10,775 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2023-01-26 17:32:10,809 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2023-01-26 17:32:10,809 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2023-01-26 17:32:10,835 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2023-01-26 17:32:10,835 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2023-01-26 17:32:10,893 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2023-01-26 17:32:10,893 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize

Finally, compute the KNN graph:

distances, indices = KNNGraph(df, 10, client, metric='cosine', batch_size=1000000)

The max indices for each batch are

Max index; 6124999
Max index: 6999990
Max index: 4712256241410432023
Max index: 4713406367098851021
Max index: 4713406367098851021
Max index: 4713406367098851021
Max index: 4713406367098851021

Expected behavior The max index cannot possibly be greater than the number of points, which is:

df.index.max()
6999999

In the above run, the two first batches are <7000000, not the 5 last ones. Thus the indices of the graph computed kneighbors are wrong. In other runs, all batches had a max index > 7000000.

Environment details (please complete the following information):

vdet commented 1 year ago

The issue persists after an upgrade to the latest (Jan 31st 2023) dev version of Rapids AI.