superliuxz / DBSCAN

Density-Based Spatial Clustering of Applications with Noise
GNU General Public License v2.0
9 stars 2 forks source link

DBSCAN

My multi-threaded, GPU-enabled implementation of Density-Based Spatial Clustering of Applications with Noise.

The GPU implementation has achieved >20x speedup against de facto SKLearn; multi-threaded CPU implementation is 40% faster than SKLearn with one thread, and scalable with multiple threads.

image info

image info

image info

Requirements

C++17; GCC 7.5; CUDA 10.2; Thrust; pthread.

How to build

How to run

Existing test data under test_inputs/.

CPU algorithm

GPU algorithm

Test data

The query parameters are selected to emulate 10% Noises.

test_input_20k.txt is generated using --cluster-std=2.2 --n-samples=20000, with 4 clusters. Should use --eps=0.15 --min-pts=180 to query.

test_input_50k.txt is generated using --cluster-std=2.2 --n-samples=50000, with 20 clusters. Should use --eps=0.09 --min-pts=110 to query.

test_input_100k.txt is generated using --cluster-std=2.2 --n-samples=100000, with 20 clusters. Should use --eps=0.07 --min-pts=100 to query.

test_input_200k.txt is generated using --cluster-std=2.2 --n-samples=200000, with 20 clusters. Should use --eps=0.05 --min-pts=100 to query.

test_input_300k.txt is generated using --cluster-std=2.2 --n-samples=300000, with 20 clusters. Should use --eps=0.044 --min-pts=100 to query.

test_input_400k.txt is generated using --cluster-std=2.2 --n-samples=400000, with 20 clusters. Should use --eps=0.04 --min-pts=105 to query.

test_input_500k.txt is generated using --cluster-std=2.2 --n-samples=500000, with 20 clusters. Should use --eps=0.035 --min-pts=115 to query.

test_input_600k.txt is generated using --cluster-std=2.2 --n-samples=600000, with 20 clusters. Should use --eps=0.035 --min-pts=120 to query.

test_input_700k.txt is generated using --cluster-std=2.2 --n-samples=600000, with 20 clusters. Should use --eps=0.033 --min-pts=122 to query.

test_input_800k.txt is generated using --cluster-std=2.2 --n-samples=800000, with 20 clusters. Should use --eps=0.03 --min-pts=120 to query.