Open haoyu-zc opened 2 months ago
I'll first discuss the second issue, found in function get_parts
, or more specifically, in function run_quantile_clustering
:
@njit(cache=True, nogil=True)
def run_quantile_clustering(data: NDArray, k: int) -> NDArray[np.int16]:
"""
Performs a simple quantile clustering on one dimensional data (1d). Quantile
clustering is defined as the procedure that forms clusters in 1d data by
separating objects using quantiles (for instance, if the median is used, two
clusters are generated with objects separated by the median). In the case
data contains all the same values (zero variance), this implementation can
return less clusters than specified with k.
Args:
data: a 1d numpy array with numerical values.
k: the number of clusters to split the data into.
Returns:
A 1d array with the data partition.
"""
data_sorted = np.argsort(data, kind="quicksort")
data_rank = rank(data, data_sorted)
data_perc = data_rank / len(data)
percentiles = [0.0] + get_perc_from_k(k) + [1.0]
cut_points = np.searchsorted(data_perc[data_sorted], percentiles, side="right")
current_cluster = 0
part = np.zeros(data.shape, dtype=np.int16) - 1
for i in range(len(cut_points) - 1):
lim1 = cut_points[i]
lim2 = cut_points[i + 1]
part[data_sorted[lim1:lim2]] = current_cluster
current_cluster += 1
return part
The assignment part[data_sorted[lim1:lim2]] = current_cluster
will update the partition for a data point that falls right onto the partition border twice, which we can tell from the test report above:
Data point 0.29614019752214493 falls on the right border of partition 1: (0.1245614294340111, 0.29614019752214493], and CCC treats it belongs to partition 2. I think one data point should only belong to one partition.
I suggest we make the partitions half-open, therefore update the assignment mentioned above, or simply use library calls which work well and potentially faster than our own implementation:
# @njit(cache=True, nogil=True)
def run_quantile_clustering(data: NDArray, k: int) -> NDArray[np.int16]:
"""
Performs a simple quantile clustering on one dimensional data (1d). Quantile
clustering is defined as the procedure that forms clusters in 1d data by
separating objects using quantiles (for instance, if the median is used, two
clusters are generated with objects separated by the median). In the case
data contains all the same values (zero variance), this implementation can
return less clusters than specified with k.
Args:
data: a 1d numpy array with numerical values.
k: the number of clusters to split the data into.
Returns:
A 1d array with the data partition.
"""
data_sorted = np.argsort(data, kind="quicksort")
data_rank = rank(data, data_sorted)
data_perc = data_rank / len(data)
percentiles = get_perc_from_k(k)
# numpy functions for data binning
bins = np.quantile(data, percentiles)
part = np.digitize(data, bins)
return part
In summary, this edge case won't affect the coef results significantly because it rarely happens. However, it's nice to have it fixed to make the binning logic be consistent with mainstream libraries.
For the first issue, I believe it's caused by an additional edge case leading to the unexpected result. Specifically, the cut_points
generated by searchsorted
incorrectly include extra data points, as shown in the debugging screenshot:
This issue can be resolved by using the updated run_quantile_clustering function, and it shouldn't significantly affect the computed coefficient. However, from a correctness standpoint, I believe the original implementation is flawed and should be fixed.
Great catch, @haoyu-zc! I agree we should fix this in the original repo (https://github.com/greenelab/ccc). I think it would be good to have a set of updated tests cases for the quantile clustering and related functions to ensure we are testing normal and edge cases like these. We can discuss more on our meeting today if you want.
This is the output I collected from one of my tests, which compares the
get_parts
results of ccc and ccc-gpu. We can see that the results differ at indices 77 and 78.For data[77] == 0.1201965612131689, clearly it should fall into partition 0: (-inf, 0.1245614294340111], however ccc gives us partition 1. This is the first issue I found.
For data[78] == 0.29614019752214493, we can see it falls right onto the right boundary of partition 1: (0.1245614294340111, 0.29614019752214493], and ccc treats it belongs to partition 2. This is not necessarily a bug but should be handled or documented specifically.
See the comments below for more analysis.
============================== 1 failed in 0.52s =============================== FAILED [100%]GPU percentiles: [0.16666666666666666 0.33333333333333337 0.5 0.6666666666666666 0.8333333333333334 ] GPU quantiles: [0.1245614294340111 0.29614019752214504 0.46748098725200393 0.6176354970758771 0.78612710718456 ]
Testing with feature_size=100, clusters=[6] GPU output shape: (1, 1, 100) CPU output shape: (1, 100)
Partition 0: GPU unique values (partitions): [0 1 2 3 4 5] CPU unique values (partitions): [0 1 2 3 4 5]
Differences found in partition 0: Number of differing elements: 2 First 10 differing indices: [77 78] GPU values at these indices: [0 1] CPU values at these indices: [1 2] Object values at these indices: [0.1201965612131689 0.29614019752214493]
Verifying partition for feature[77] = 0.1201965612131689 CPU percentages: [0.16666666666666666, 0.3333333333333333, 0.5, 0.6666666666666666, 0.8333333333333333] CPU quantities: [0.1245614294340111 0.29614019752214493 0.46748098725200393 0.6176354970758771 0.7861271071845599 ]
All partition ranges: Partition 0 range: (-inf, 0.1245614294340111] Partition 1 range: (0.1245614294340111, 0.29614019752214493] Partition 2 range: (0.29614019752214493, 0.46748098725200393] Partition 3 range: (0.46748098725200393, 0.6176354970758771] Partition 4 range: (0.6176354970758771, 0.7861271071845599] Partition 5 range: (0.7861271071845599, inf) Data point 0.1201965612131689 should fall in partition 0 Partition computed by CCC_CPU: 1