haoyu-zc / ccc-gpu

Other
0 stars 0 forks source link

Edge case found in function get_parts #1

Open haoyu-zc opened 1 week ago

haoyu-zc commented 1 week ago

============================= test session starts ============================== collecting ... collected 1 item

test_get_parts_debug.py::test_get_parts[cluster_settings0-100]

============================== 1 failed in 0.50s =============================== FAILED [100%]GPU percentiles: [0.16666666666666666 0.33333333333333337 0.5 0.6666666666666666 0.8333333333333334 ]

Testing with feature_size=100, clusters=[6] GPU output shape: (1, 1, 100) CPU output shape: (1, 100)

Partition 0: GPU unique values (partitions): [0 1 2 3 4 5] CPU unique values (partitions): [0 1 2 3 4 5]

Differences found in partition 0: Number of differing elements: 1 First 10 differing indices: [78] GPU values at these indices: [1] CPU values at these indices: [2] Object values at these indices: [0.29614019752214493]

Verifying partition for feature[78] = 0.29614019752214493 CPU percentages: [0.16666666666666666, 0.3333333333333333, 0.5, 0.6666666666666666, 0.8333333333333333] CPU quantities: [0.1245614294340111 0.29614019752214493 0.46748098725200393 0.6176354970758771 0.7861271071845599 ]

All partition ranges: Partition 0 range: (-inf, 0.1245614294340111] Partition 1 range: (0.1245614294340111, 0.29614019752214493] Partition 2 range: (0.29614019752214493, 0.46748098725200393] Partition 3 range: (0.46748098725200393, 0.6176354970758771] Partition 4 range: (0.6176354970758771, 0.7861271071845599] Partition 5 range: (0.7861271071845599, inf) Data point 0.29614019752214493 should fall in partition 1 Partition computed by CCC_CPU: 2

tests/gpu/test_get_parts_debug.py:61 (test_get_parts[cluster_settings0-100]) 1 != 2

Expected :2 Actual :1

haoyu-zc commented 1 week ago

Recently, I've been debugging CCC-GPU using the results generated by CCC. For now, I came to the conclusion that the minor discrepancy I found is due to an edge case in function get_parts, or more specifically, in function run_quantile_clustering:

@njit(cache=True, nogil=True)
def run_quantile_clustering(data: NDArray, k: int) -> NDArray[np.int16]:
    """
    Performs a simple quantile clustering on one dimensional data (1d). Quantile
    clustering is defined as the procedure that forms clusters in 1d data by
    separating objects using quantiles (for instance, if the median is used, two
    clusters are generated with objects separated by the median). In the case
    data contains all the same values (zero variance), this implementation can
    return less clusters than specified with k.

    Args:
        data: a 1d numpy array with numerical values.
        k: the number of clusters to split the data into.

    Returns:
        A 1d array with the data partition.
    """
    data_sorted = np.argsort(data, kind="quicksort")
    data_rank = rank(data, data_sorted)
    data_perc = data_rank / len(data)

    percentiles = [0.0] + get_perc_from_k(k) + [1.0]

    cut_points = np.searchsorted(data_perc[data_sorted], percentiles, side="right")

    current_cluster = 0
    part = np.zeros(data.shape, dtype=np.int16) - 1

    for i in range(len(cut_points) - 1):
        lim1 = cut_points[i]
        lim2 = cut_points[i + 1]

        part[data_sorted[lim1:lim2]] = current_cluster
        current_cluster += 1

    return part

The assignment part[data_sorted[lim1:lim2]] = current_cluster will update the partition for a datapoint that falls right onto the partition border twice, which we can tell from the test report above:

Data point 0.29614019752214493 falls on the right border of partition 1: (0.1245614294340111, 0.29614019752214493], and CCC treats it belongs to partition 2.

I suggest we make the partitions one-side-closed, therefore update the assignment mentioned above or simply use library calls which works well:

# @njit(cache=True, nogil=True)
def run_quantile_clustering(data: NDArray, k: int) -> NDArray[np.int16]:
    """
    Performs a simple quantile clustering on one dimensional data (1d). Quantile
    clustering is defined as the procedure that forms clusters in 1d data by
    separating objects using quantiles (for instance, if the median is used, two
    clusters are generated with objects separated by the median). In the case
    data contains all the same values (zero variance), this implementation can
    return less clusters than specified with k.

    Args:
        data: a 1d numpy array with numerical values.
        k: the number of clusters to split the data into.

    Returns:
        A 1d array with the data partition.
    """
    data_sorted = np.argsort(data, kind="quicksort")
    data_rank = rank(data, data_sorted)
    data_perc = data_rank / len(data)

    percentiles = get_perc_from_k(k)

    # numpy functions for data binning
    bins = np.quantile(data, percentiles)
    part = np.digitize(data, bins)
    return part