opendp / opendp

The core library of differential privacy algorithms powering the OpenDP Project.
https://opendp.org
MIT License
278 stars 46 forks source link

Add Sample and Aggregate Framework #1481

Open chikeabuah opened 3 weeks ago

chikeabuah commented 3 weeks ago

https://programming-dp.com/ch7.html#sample-and-aggregate

Shoeboxam commented 3 weeks ago

One potential design for this.

Recalling the l0, l1, linf triple of distances for characterizing distances between partitions:

def d_Partition(x, x_p, d=d_Sym):
    """L0, L1 and L∞ norms of the distances between neighboring partitions"""
    dists = abs(np.array([d(x_i, y_i) for x_i, y_i in zip(x, x_p)]))
    #      |d(x, x')|_0     , |d(x, x')|_1, |d(x, x')|_∞
    return (dists != 0).sum(), dists.sum(), dists.max()

Sample aggregate works with any transformation that splits data and measures distances with this triple. One such way is by taking a random partitioning:


def make_partition_randomly(input_domain, num_partitions):
    dp.assert_features("contrib")
    assert can_be_partitioned(input_domain) # like vectors, arrays

    def f_partition_randomly(data):
        data = np.array(data, copy=True)
        # randomly partition data into `num_partitions` data sets
        np.random.shuffle(data)
        return np.array_split(data, num_partitions)

    return dp.t.make_user_transformation(
        input_domain=input_domain,
        input_metric=dp.symmetric_distance(),
        output_domain=dp.vector_domain(input_domain, size=num_partitions),
        output_metric=partition_distance(dp.symmetric_distance()),
        function=f_partition_randomly,
        # TODO: what kind of domain descriptors can you use to improve this?
        # TODO: how might you modify the function to improve this?
        stability_map=lambda b_in: (b_in, b_in, b_in))

Alternatively, you might group, or partition randomly but where each user only contributes to at most one partition.

Regardless of how you split/sample the data, the sample and aggregate transformation need only make use of the l0 field of the distance triple:

def make_sample_and_aggregate(input_domain, output_domain, num_partitions, black_box):
    dp.assert_features("contrib", "honest-but-curious")

    # SAMPLE: randomly partition data into `num_partitions` data sets
    trans_subsample = make_partition_randomly(input_domain, num_partitions)

    # AGGREGATE: apply `black_box` function to each partition
    trans_aggregate = dp.t.make_user_transformation(
        input_domain=dp.vector_domain(input_domain, size=num_partitions),
        output_domain=dp.vector_domain(output_domain, size=num_partitions),
        input_metric=partition_distance(dp.symmetric_distance()),
        output_metric=dp.hamming_distance(),
        function=lambda parts: [black_box(part) for part in parts],
        stability_map=lambda b_in: b_in[0]) # 1-stable

    return trans_subsample >> trans_aggregate