Describe the bug
First of all, I am no expert. Please forgive me it the problem is an error of mine and not a bug.
I am trying to create a K nearest neighbors graph for 7,000,000+ points sitting in a 768 dimension space, with k=10.
Since a single GPU does not have enough RAM for the task, I reverted to the dask implementation of cuml nearest neighbors.
cuml.dask's kneighbors returns an erroneous index containing values much greater that the number of nodes in the graph.
Steps/Code to reproduce bug
import math
import numpy as np
import pandas as pd
from sklearn.datasets import make_blobs
from distributed import Client
import cupy as cp
import dask.array as da
from cuml.dask.neighbors import NearestNeighbors as cuDistNearestNeighbors
from dask_cuda import LocalCUDACluster
## This function construct the graph
def KNNGraph(df, k, client, metric='cosine', batch_size=524288, n_workers=8) :
chunk_size = math.ceil(df.shape[0]/n_workers)
## Create cudf object
da_df = da.from_array(df.to_numpy(dtype=np.float32), chunks=chunk_size)
## Use k+1 because self loops are removed later
model = cuDistNearestNeighbors(client=client, n_neighbors=k+1, metric=metric, batch_size=batch_size)
model.fit(da_df)
## For some reasons I had to batchify calls to kneighbors due to an issue in cuml
## (https://github.com/rapidsai/cuml/issues/5056#issuecomment-1339626036).
## The 3 lines of code below should work in principle but didn't:
## distances, indices = model.kneighbors(da_df)
## indices = cp.asnumpy(indices)
## distances = cp.asnumpy(distances)
distances, indices = model.kneighbors(da_df[0:batch_size,:])
indices = cp.asnumpy(indices)
distances = cp.asnumpy(distances)
print(f"Max index; {indices.max()}") ##should be < df.shape[0]
for b in range(1, math.ceil(da_df.shape[0]/batch_size)) :
dist, ind = model.kneighbors(da_df[b*batch_size:min(da_df.shape[0], (b+1)*batch_size), :])
indices = np.concatenate([indices, cp.asnumpy(ind)], axis=0)
print(f"Max index: {indices.max()}")
distances = np.concatenate([distances, cp.asnumpy(dist)], axis=0)
return distances[:, 1:], indices[:, 1:]
## Generate data
df = pd.DataFrame(make_blobs(n_samples=7000000, centers=1, n_features=768, random_state=0)[0])
## Set up client
cluster = LocalCUDACluster(
protocol="ucx",
rmm_pool_size="1GB"
)
client = Client(cluster)
The last command producse the following system messages:
2023-01-26 17:32:09,234 - distributed.comm.ucx - WARNING - A CUDA context for device 0 (b'GPU-30a44e33-50f4-707f-dccc-e98b08e2cdf9') already exists on process ID 2654889. This is often the result of a CUDA-enabled library calling a CUDA runtime function before Dask-CUDA can spawn worker processes. Please make sure any such function calls don't happen at import time or in the global scope of a program.
2023-01-26 17:32:10,627 - distributed.diskutils - INFO - Found stale lock file and directory '/tmp/dask-worker-space/worker-cg7q0d0v', purging
2023-01-26 17:32:10,628 - distributed.diskutils - INFO - Found stale lock file and directory '/tmp/dask-worker-space/worker-12owxm0r', purging
2023-01-26 17:32:10,628 - distributed.diskutils - INFO - Found stale lock file and directory '/tmp/dask-worker-space/worker-z9rtbi3p', purging
2023-01-26 17:32:10,628 - distributed.diskutils - INFO - Found stale lock file and directory '/tmp/dask-worker-space/worker-01xxa6pf', purging
2023-01-26 17:32:10,628 - distributed.diskutils - INFO - Found stale lock file and directory '/tmp/dask-worker-space/worker-11hizl15', purging
2023-01-26 17:32:10,629 - distributed.diskutils - INFO - Found stale lock file and directory '/tmp/dask-worker-space/worker-27r_3i27', purging
2023-01-26 17:32:10,629 - distributed.diskutils - INFO - Found stale lock file and directory '/tmp/dask-worker-space/worker-pri9gdfp', purging
2023-01-26 17:32:10,629 - distributed.diskutils - INFO - Found stale lock file and directory '/tmp/dask-worker-space/worker-ldsnbyiy', purging
2023-01-26 17:32:10,630 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2023-01-26 17:32:10,630 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2023-01-26 17:32:10,650 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2023-01-26 17:32:10,650 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2023-01-26 17:32:10,711 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2023-01-26 17:32:10,711 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2023-01-26 17:32:10,745 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2023-01-26 17:32:10,745 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2023-01-26 17:32:10,775 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2023-01-26 17:32:10,775 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2023-01-26 17:32:10,809 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2023-01-26 17:32:10,809 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2023-01-26 17:32:10,835 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2023-01-26 17:32:10,835 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2023-01-26 17:32:10,893 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2023-01-26 17:32:10,893 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
Finally, compute the KNN graph:
distances, indices = KNNGraph(df, 10, client, metric='cosine', batch_size=1000000)
The max indices for each batch are
Max index; 6124999
Max index: 6999990
Max index: 4712256241410432023
Max index: 4713406367098851021
Max index: 4713406367098851021
Max index: 4713406367098851021
Max index: 4713406367098851021
Expected behavior
The max index cannot possibly be greater than the number of points, which is:
df.index.max()
6999999
In the above run, the two first batches are <7000000, not the 5 last ones. Thus the indices of the graph computed kneighbors are wrong. In other runs, all batches had a max index > 7000000.
Environment details (please complete the following information):
Describe the bug First of all, I am no expert. Please forgive me it the problem is an error of mine and not a bug.
I am trying to create a K nearest neighbors graph for 7,000,000+ points sitting in a 768 dimension space, with k=10.
Since a single GPU does not have enough RAM for the task, I reverted to the dask implementation of cuml nearest neighbors.
cuml.dask's
kneighbors
returns an erroneous index containing values much greater that the number of nodes in the graph.Steps/Code to reproduce bug
The last command producse the following system messages:
Finally, compute the KNN graph:
The max indices for each batch are
Expected behavior The max index cannot possibly be greater than the number of points, which is:
In the above run, the two first batches are <7000000, not the 5 last ones. Thus the indices of the graph computed
kneighbors
are wrong. In other runs, all batches had a max index > 7000000.Environment details (please complete the following information):