Open chikeabuah opened 3 weeks ago
One potential design for this.
Recalling the l0, l1, linf triple of distances for characterizing distances between partitions:
def d_Partition(x, x_p, d=d_Sym):
"""L0, L1 and L∞ norms of the distances between neighboring partitions"""
dists = abs(np.array([d(x_i, y_i) for x_i, y_i in zip(x, x_p)]))
# |d(x, x')|_0 , |d(x, x')|_1, |d(x, x')|_∞
return (dists != 0).sum(), dists.sum(), dists.max()
Sample aggregate works with any transformation that splits data and measures distances with this triple. One such way is by taking a random partitioning:
def make_partition_randomly(input_domain, num_partitions):
dp.assert_features("contrib")
assert can_be_partitioned(input_domain) # like vectors, arrays
def f_partition_randomly(data):
data = np.array(data, copy=True)
# randomly partition data into `num_partitions` data sets
np.random.shuffle(data)
return np.array_split(data, num_partitions)
return dp.t.make_user_transformation(
input_domain=input_domain,
input_metric=dp.symmetric_distance(),
output_domain=dp.vector_domain(input_domain, size=num_partitions),
output_metric=partition_distance(dp.symmetric_distance()),
function=f_partition_randomly,
# TODO: what kind of domain descriptors can you use to improve this?
# TODO: how might you modify the function to improve this?
stability_map=lambda b_in: (b_in, b_in, b_in))
Alternatively, you might group, or partition randomly but where each user only contributes to at most one partition.
Regardless of how you split/sample the data, the sample and aggregate transformation need only make use of the l0
field of the distance triple:
def make_sample_and_aggregate(input_domain, output_domain, num_partitions, black_box):
dp.assert_features("contrib", "honest-but-curious")
# SAMPLE: randomly partition data into `num_partitions` data sets
trans_subsample = make_partition_randomly(input_domain, num_partitions)
# AGGREGATE: apply `black_box` function to each partition
trans_aggregate = dp.t.make_user_transformation(
input_domain=dp.vector_domain(input_domain, size=num_partitions),
output_domain=dp.vector_domain(output_domain, size=num_partitions),
input_metric=partition_distance(dp.symmetric_distance()),
output_metric=dp.hamming_distance(),
function=lambda parts: [black_box(part) for part in parts],
stability_map=lambda b_in: b_in[0]) # 1-stable
return trans_subsample >> trans_aggregate
https://programming-dp.com/ch7.html#sample-and-aggregate