Open haoyu-zc opened 1 week ago
Recently, I've been debugging CCC-GPU using the results generated by CCC. For now, I came to the conclusion that the minor discrepancy I found is due to an edge case in function get_parts
, or more specifically, in function run_quantile_clustering
:
@njit(cache=True, nogil=True)
def run_quantile_clustering(data: NDArray, k: int) -> NDArray[np.int16]:
"""
Performs a simple quantile clustering on one dimensional data (1d). Quantile
clustering is defined as the procedure that forms clusters in 1d data by
separating objects using quantiles (for instance, if the median is used, two
clusters are generated with objects separated by the median). In the case
data contains all the same values (zero variance), this implementation can
return less clusters than specified with k.
Args:
data: a 1d numpy array with numerical values.
k: the number of clusters to split the data into.
Returns:
A 1d array with the data partition.
"""
data_sorted = np.argsort(data, kind="quicksort")
data_rank = rank(data, data_sorted)
data_perc = data_rank / len(data)
percentiles = [0.0] + get_perc_from_k(k) + [1.0]
cut_points = np.searchsorted(data_perc[data_sorted], percentiles, side="right")
current_cluster = 0
part = np.zeros(data.shape, dtype=np.int16) - 1
for i in range(len(cut_points) - 1):
lim1 = cut_points[i]
lim2 = cut_points[i + 1]
part[data_sorted[lim1:lim2]] = current_cluster
current_cluster += 1
return part
The assignment part[data_sorted[lim1:lim2]] = current_cluster
will update the partition for a datapoint that falls right onto the partition border twice, which we can tell from the test report above:
Data point 0.29614019752214493 falls on the right border of partition 1: (0.1245614294340111, 0.29614019752214493], and CCC treats it belongs to partition 2.
I suggest we make the partitions one-side-closed, therefore update the assignment mentioned above or simply use library calls which works well:
# @njit(cache=True, nogil=True)
def run_quantile_clustering(data: NDArray, k: int) -> NDArray[np.int16]:
"""
Performs a simple quantile clustering on one dimensional data (1d). Quantile
clustering is defined as the procedure that forms clusters in 1d data by
separating objects using quantiles (for instance, if the median is used, two
clusters are generated with objects separated by the median). In the case
data contains all the same values (zero variance), this implementation can
return less clusters than specified with k.
Args:
data: a 1d numpy array with numerical values.
k: the number of clusters to split the data into.
Returns:
A 1d array with the data partition.
"""
data_sorted = np.argsort(data, kind="quicksort")
data_rank = rank(data, data_sorted)
data_perc = data_rank / len(data)
percentiles = get_perc_from_k(k)
# numpy functions for data binning
bins = np.quantile(data, percentiles)
part = np.digitize(data, bins)
return part
============================= test session starts ============================== collecting ... collected 1 item
test_get_parts_debug.py::test_get_parts[cluster_settings0-100]
============================== 1 failed in 0.50s =============================== FAILED [100%]GPU percentiles: [0.16666666666666666 0.33333333333333337 0.5 0.6666666666666666 0.8333333333333334 ]
Testing with feature_size=100, clusters=[6] GPU output shape: (1, 1, 100) CPU output shape: (1, 100)
Partition 0: GPU unique values (partitions): [0 1 2 3 4 5] CPU unique values (partitions): [0 1 2 3 4 5]
Differences found in partition 0: Number of differing elements: 1 First 10 differing indices: [78] GPU values at these indices: [1] CPU values at these indices: [2] Object values at these indices: [0.29614019752214493]
Verifying partition for feature[78] = 0.29614019752214493 CPU percentages: [0.16666666666666666, 0.3333333333333333, 0.5, 0.6666666666666666, 0.8333333333333333] CPU quantities: [0.1245614294340111 0.29614019752214493 0.46748098725200393 0.6176354970758771 0.7861271071845599 ]
All partition ranges: Partition 0 range: (-inf, 0.1245614294340111] Partition 1 range: (0.1245614294340111, 0.29614019752214493] Partition 2 range: (0.29614019752214493, 0.46748098725200393] Partition 3 range: (0.46748098725200393, 0.6176354970758771] Partition 4 range: (0.6176354970758771, 0.7861271071845599] Partition 5 range: (0.7861271071845599, inf) Data point 0.29614019752214493 should fall in partition 1 Partition computed by CCC_CPU: 2
tests/gpu/test_get_parts_debug.py:61 (test_get_parts[cluster_settings0-100]) 1 != 2
Expected :2 Actual :1