Open joshua-oss opened 3 years ago
Two more enhancements should be made in this PR:
Partially addressed in #486
FWIW I had this problem internally, and handled it with a wrapper around privacy that did as follows:
def __init__(self, data_conn, metadata, max_epsilon, scaling_const=1e3):
self.scaling_const = scaling_const
init_privacy = Privacy(epsilon=max_epsilon/self.scaling_const, delta=1e-5)
self.reader = snsql.from_connection(data_conn, metadata=metadata, privacy=init_privacy)
self.initial_budget = max_epsilon
self.curr_epsilon_scale = max_epsilon/scaling_const
self.spent = (0, 0)
def _reset_privacy(self, privacy_obj):
self.reader.privacy = privacy_obj
self.reader.odometer.privacy = privacy_obj
def execute(self, query):
remaining = self.initial_budget - self.spent[0]
privacy_cost = self.reader.get_privacy_cost(query)
while privacy_cost[0] >= remaining:
self.curr_epsilon_scale /= self.scaling_const
self._reset_privacy(Privacy(epsilon=self.curr_epsilon_scale, delta=1e-5))
privacy_cost = self.reader.get_privacy_cost(query)
result = self.reader.execute(query)
self.spent = self.reader.odometer.spent
return result
Also, RE serialization/rehydration, are a getter and setter for steps not enough?
Yes, I think that would work the best for the heterogeneous odometer, and for the homogeneous odometer, we just need the 'k' and the privacy parameters. At some point, we would like to allow even tighter composition using the PRV accountant [1], in which case the "steps" would be a list that including the mechanism and the noise scale. Similar if we add an accountant that supports the zCDP that has recently been added to OpenDP. In any case, though, you are correct that we could just serialize the list of steps from the specific accountant being used, and load back in on re-serialization.
If we are just pickling to a file, that's pretty easy. But I imagine services would want to read and update the serialized spend from a transactional store, to prevent double spend or free spend. We could provide a code sample of doing that for homogeneous just by wrapping a transaction around the update of 'k', and for heterogeneous by appending a 'step' row. It would be nice if there were a way to serialize and deserlialize within a transaction without having to know the internal schema of the specific odometer, though. Maybe pickle to a blob?
We have a code sample at [2] which shows how to share an odometer between multiple readers to make a blended workload with different epsilons. It relies on the fact that the object is shared, rather than serialization.
We've refrained from adding a helper that tells if there is enough budget left, since we didn't want to give the impression that it's safe without the service-specific code to ensure transactional update. So we leave that to people to wrap themselves.
[1] https://github.com/microsoft/prv_accountant [2] https://docs.smartnoise.org/sql/advanced.html#managing-privacy-budget
Per: https://github.com/opendp/opendp/discussions/379
The get_query_cost method should allow an arbitrary list of queries, and return the total spend for all queries in the batch.
If the caller wants to keep budget below a certain threshold, this method could be called repeatedly to find the per-column privacy parameters that achieve that threshold. This is cumbersome, since the privacy parameters are passed in at reader instantiation time, requiring the search to create a new reader for each set of privacy parameters at each step of the search. We could make a helper that does the search for the caller, while shortcutting the need to instantiate a new reader at every step. This entire process can probably be outsourced to opendp at some point in the future.
A more complicated case than a when the workload includes queries with different desired noise scales. For example, a "GROUP BY educ, race" that uses epsilon 0.5, and then two additional queries, one for each one-way, each using 0.1. In this case, computing total spend for the batch is still not difficult, and only requires the 'steps' from the separate odometers to be merged in an odometer that can compose them. The calling API requires some thought, though. More difficult is if the caller wants to do a binary search to adjust these heterogeneous accuracies to fit within a certain budget. We can provide code samples, but it's not clear how this could be cleanly exposed via an automatic API