[BUG] cuml.cluster.HDBSCAN.fit_predict (GPU accelerated) is slower than hdbscan.HDBSCAN.fit_predict (CPU only)!

Describe the bug hdbscan library's fit_predict produces faster output. How to get GPU acceleration?

Steps/Code to reproduce bug Try running this code (Requirements: cupy, numba, hdbscan)

from time import perf_counter_ns

import cupy as cp
import numba as nb
import numpy as np
from cuml.cluster import HDBSCAN as HDBSCAN_GPU
from hdbscan import HDBSCAN as HDBSCAN_CPU

class Test:
    def __init__(self) -> None:
        self.model = HDBSCAN_CPU(min_samples=10, min_cluster_size=10)
        self.model_cuml = HDBSCAN_GPU(min_samples=20, min_cluster_size=10)
        self.total = 0
        self.total_cuml = 0
        self.counter = 0

    def test(self, num_points=4000, use_cupy=True, use_xy_only=False):
        arr = np.random.random((num_points, 3)) * 100
        if use_xy_only:
            arr = arr[:, :2]
        if use_cupy:
            arr = cp.asarray(arr)
            arr = nb.cuda.to_device(arr)
        t0 = perf_counter_ns()
        y_hat = self.model.fit_predict(arr)
        elapsed = (perf_counter_ns() - t0) // 1000000
        self.total += elapsed
        print("------------------------------ CPU %d ms -----------------------------" % elapsed, flush=True)
        t0 = perf_counter_ns()
        y_hat = self.model_cuml.fit_predict(arr)
        elapsed = (perf_counter_ns() - t0) // 1000000
        self.total_cuml += elapsed
        print("------------------------------ GPU %d ms -----------------------------" % elapsed, flush=True)
        self.counter += 1

tester = Test()

for i in range(100):
    tester.test()

print("Average time %f ms over %d iterations on CPU." % (tester.total / tester.counter, tester.counter), flush=True)
print("Average time %f ms over %d iterations  on GPU." % (tester.total_cuml / tester.counter, tester.counter), flush=True)

With the current default pointcloud size, I get the following output:

Average time 83.730000 ms over 100 iterations on CPU.
Average time 127.030000 ms over 100 iterations on GPU.

Expected behavior I would expect the cuML's GPU accelerated clustering to be much faster than the normal CPU based one.

Environment details (please complete the following information):

Environment location: Bare-metal
Linux Distro/Architecture: [Ubuntu 22.04 amd64, but want to use Jetson AGX Xavier (so could be Ubuntu 20.04, jetson-aarch64 architecutre]
GPU Model/Driver: [GeForce RTX 3090 and driver 550.54.15]
CUDA: [12.4]
Method of cuDF & cuML install: pip

rapidsai / cuml

[BUG] cuml.cluster.HDBSCAN.fit_predict (GPU accelerated) is slower than hdbscan.HDBSCAN.fit_predict (CPU only)! #6117