cuml DBSCAN running slow with numba device array[QST]

Xuechun-Hu commented 3 years ago

What is your question? I'm trying to use the cuml ofrapids to accelerate the process of dbscan clustering 15millions float64 data point.

pp = nb.cuda.to_device(ps) # ps is a (15636915,2) cupy array

with cuml.using_output_type('input'): db_gpu = cumlDBSCAN(eps=0.8, min_samples=100,verbose=5).fit_predict(ps,out_dtype='int64')

But it is running much slower than ski-learn which is running on cpu. And it is not taking up the full memory of my gpu log info [D] [20:08:15.269066] cuml/common/logger.cpp:3088 Non contiguous array or view detected, a contiguous copy of the data will be done. [D] [20:08:15.274689] ../src/dbscan/dbscan.cuh:133 #0 owns 15636915 rows [D] [20:08:15.275074] ../src/dbscan/dbscan.cuh:150 Dataset memory: 250 MB [D] [20:08:15.275146] ../src/dbscan/dbscan.cuh:152 Estimated available memory: 40583 / 51041 MB [D] [20:08:15.275214] ../src/dbscan/dbscan.cuh:161 Running batched training (batch size: 287, estimated: 40530.888272 MB) [D] [20:08:15.275703] ../src/dbscan/dbscan.cuh:182 Workspace size: 4753.627392 MB

---------------------------------------------------------------------nvidia-smi---------------------------------------------- 0 N/A N/A 39054 C ...s/rapids-21.10/bin/python 6535MiB
6950MiB / 48677MiB memory

cjnolet commented 3 years ago

@Xuechun-Hu,

DBSCAN's method for batching is currently not the most ideal because it requires having the distances between points in each batch and all the other points in the dataset available in the GPU memory at the same time. This causes the batch size to shrink in proportion to the number of points, thus as you are seeing here, the batch size is only 287 and with 15 million total points that results in over 52k batches. You should be able to increase the performance here by increasing your batch size with float32 inputs instead float64.

We're working on ways to fix this but for now, unfortunately, the algorithm simply has some limitations to scale. Another option might be to try the new HDBSCAN implementation, which can be found in cuml 21.10.

Xuechun-Hu commented 3 years ago

@Xuechun-Hu,

DBSCAN's method for batching is currently not the most ideal because it requires having the distances between points in each batch and all the other points in the dataset available in the GPU memory at the same time. This causes the batch size to shrink in proportion to the number of points, thus as you are seeing here, the batch size is only 287 and with 15 million total points that results in over 52k batches. You should be able to increase the performance here by increasing your batch size with float32 inputs instead float64.

We're working on ways to fix this but for now, unfortunately, the algorithm simply has some limitations to scale. Another option might be to try the new HDBSCAN implementation, which can be found in cuml 21.10.

Thank you very much, that explains a lot! Appreciate that. Changing the input data format did increase the batch size by 1 from 287 to 288. I also run into problem when using HDBSCAN

| ID | GPU | MEM |

| 0 | 14% | 4% | terminate called after throwing an instance of 'rmm::bad_alloc' what(): std::bad_alloc: CUDA error at: /home/huang/Charles2021/python/envs/rapids-21.10/include/rmm/mr/device/cuda_memory_resource.hpp:70: cudaErrorMemoryAllocation out of memory

The GPU has 48gb memory and the dataset is about 250mb. I tried a various data input like cupy and numpy.

cjnolet commented 3 years ago

@Xuechun-Hu, it will be helpful for us if you can supply a script and dataset to reproduce the behavior you are seeing.

Xuechun-Hu commented 3 years ago

@cjnolet Thanks Here is my code ''' import pandas as pd import cudf import cupy as cp import cuml from cuml.cluster import DBSCAN as cumlDBSCAN from cuml.cluster import HDBSCAN as HDBSCAN import os import matplotlib.pyplot as plt from matplotlib.colors import ListedColormap from sklearn.datasets import make_circles import numpy as np import numba as nb import open3d as o3d

pcd = o3d.io.read_point_cloud('data/pointcloud/block_0.ply') pc = np.asarray(pcd.points)[:,:2] ps = cp.asarray(pcd.points,dtype='float32')[:,:2] pp = nb.cuda.to_device(ps)

with cuml.using_output_type('input'): db_gpu = HDBSCAN(min_cluster_size = 5000, min_samples = 100, verbose=True , p = 1).fit(pp) '''

I'm trying to cluster a (15636915, 2) dataset, which looks like this [[-230.19247 43.109245] [-225.90079 41.327675] [-222.1525 33.68174 ] ... [-224.55055 22.347908] [-224.50694 22.773636] [-223.6284 24.611273]]

Xuechun-Hu commented 3 years ago

Update: setting min_samples = 50, the HDBSCAN runs but takes up nearly 40gb of memory

Xuechun-Hu commented 3 years ago

the data file can be found in https://drive.google.com/drive/folders/1FXQ9zT3uqEblHsw5xaC-w_KyUaczEAI4?usp=sharing

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

rapidsai / cuml

cuml DBSCAN running slow with numba device array[QST] #4276

| ID | GPU | MEM |