opendp / smartnoise-sdk

Tools and service for differentially private processing of tabular and relational data
MIT License
254 stars 69 forks source link

Allow batches for get_privacy_cost #422

Open joshua-oss opened 3 years ago

joshua-oss commented 3 years ago

Per: https://github.com/opendp/opendp/discussions/379

The get_query_cost method should allow an arbitrary list of queries, and return the total spend for all queries in the batch.

If the caller wants to keep budget below a certain threshold, this method could be called repeatedly to find the per-column privacy parameters that achieve that threshold. This is cumbersome, since the privacy parameters are passed in at reader instantiation time, requiring the search to create a new reader for each set of privacy parameters at each step of the search. We could make a helper that does the search for the caller, while shortcutting the need to instantiate a new reader at every step. This entire process can probably be outsourced to opendp at some point in the future.

A more complicated case than a when the workload includes queries with different desired noise scales. For example, a "GROUP BY educ, race" that uses epsilon 0.5, and then two additional queries, one for each one-way, each using 0.1. In this case, computing total spend for the batch is still not difficult, and only requires the 'steps' from the separate odometers to be merged in an odometer that can compose them. The calling API requires some thought, though. More difficult is if the caller wants to do a binary search to adjust these heterogeneous accuracies to fit within a certain budget. We can provide code samples, but it's not clear how this could be cleanly exposed via an automatic API

joshua-oss commented 2 years ago

Two more enhancements should be made in this PR:

  1. Odometer should allow serialization and rehydration. This will enable stateless services to keep track of cumulative budget spend.
  2. We should allow computation of the total cumulative spend for a given query when added to everything that's already been queried. That is, given an odometer that has a log of all mechanism invocations in a session, what would the total spend be if we ran this additional query? This would allow services to check to make sure some fixed budget would not be exceeded before running a query.
joshua-oss commented 2 years ago

Partially addressed in #486

FishmanL commented 2 years ago

FWIW I had this problem internally, and handled it with a wrapper around privacy that did as follows:

    def __init__(self, data_conn, metadata, max_epsilon, scaling_const=1e3):
        self.scaling_const = scaling_const
        init_privacy = Privacy(epsilon=max_epsilon/self.scaling_const, delta=1e-5)
        self.reader = snsql.from_connection(data_conn, metadata=metadata, privacy=init_privacy)
        self.initial_budget = max_epsilon
        self.curr_epsilon_scale = max_epsilon/scaling_const
        self.spent = (0, 0)
    def _reset_privacy(self, privacy_obj):
        self.reader.privacy = privacy_obj
        self.reader.odometer.privacy = privacy_obj

    def execute(self, query):
        remaining = self.initial_budget - self.spent[0]
        privacy_cost = self.reader.get_privacy_cost(query)
        while privacy_cost[0] >= remaining:
            self.curr_epsilon_scale /= self.scaling_const
            self._reset_privacy(Privacy(epsilon=self.curr_epsilon_scale, delta=1e-5))
            privacy_cost = self.reader.get_privacy_cost(query)
        result = self.reader.execute(query)
        self.spent = self.reader.odometer.spent
        return result
FishmanL commented 2 years ago

Also, RE serialization/rehydration, are a getter and setter for steps not enough?

joshua-oss commented 2 years ago

Yes, I think that would work the best for the heterogeneous odometer, and for the homogeneous odometer, we just need the 'k' and the privacy parameters. At some point, we would like to allow even tighter composition using the PRV accountant [1], in which case the "steps" would be a list that including the mechanism and the noise scale. Similar if we add an accountant that supports the zCDP that has recently been added to OpenDP. In any case, though, you are correct that we could just serialize the list of steps from the specific accountant being used, and load back in on re-serialization.

If we are just pickling to a file, that's pretty easy. But I imagine services would want to read and update the serialized spend from a transactional store, to prevent double spend or free spend. We could provide a code sample of doing that for homogeneous just by wrapping a transaction around the update of 'k', and for heterogeneous by appending a 'step' row. It would be nice if there were a way to serialize and deserlialize within a transaction without having to know the internal schema of the specific odometer, though. Maybe pickle to a blob?

We have a code sample at [2] which shows how to share an odometer between multiple readers to make a blended workload with different epsilons. It relies on the fact that the object is shared, rather than serialization.

We've refrained from adding a helper that tells if there is enough budget left, since we didn't want to give the impression that it's safe without the service-specific code to ensure transactional update. So we leave that to people to wrap themselves.

[1] https://github.com/microsoft/prv_accountant [2] https://docs.smartnoise.org/sql/advanced.html#managing-privacy-budget