opendp / smartnoise-sdk

Tools and service for differentially private processing of tabular and relational data
MIT License
245 stars 65 forks source link

Implement MIN and MAX #31

Open joshua-oss opened 4 years ago

joshua-oss commented 4 years ago

May be able to do PERCENTILE_DISC

FishmanL commented 1 year ago

ah yep just saw this

FishmanL commented 1 year ago

Is there some reason why passing min/max to the same function as sum/count wouldn't work? Naively it seems like it should

joshua-oss commented 1 year ago

There is some more discussion in #367, where MIN, MAX, and MEDIAN can be considered to be special cases of quantiles.

For SUM, we require the data owner to supply metadata which specifies lower and upper bounds, which is how we know the sensitivity needed to determine how much noise to use. (For COUNT, sensitivity is just 1). The sensitivity of Min and Max is trickier, because we usually don't have bounds if we are looking for MIN and MAX, and in fact we might be trying to estimate the bounds so that we can safely compute the sensitivity for a SUM when no metadata is available. Since we don't know the min and max, we don't know ahead of time how much any individual can impact the output.

We do have a mechanism to calculate the approximate bounds, snsql.sql._mechanisms.approx_bounds, but it needs a vector of values. In other words, unlike the other statistics we calculate, we can't rely on the database engine to compute a non-noisy aggregate that we then postprocess by adding noise. We need a sample of individual rows. (The same is true for all quantiles). So, the way we would implement is by pulling back a random sample of 1,000 values packed into a cell for each partition, then unpack the values and pass through to the approximate bounds calculation. As described in #367, the SQL for this is different for every engine since it's not SQL-92. It's not super difficult to implement, but would need a lot of unit tests and CI to make sure the syntax is correct for all engines.

Also note that the MIN and MAX don't always have great accuracy even with high epsilon spend. Quantiles like median or 25%, etc. work much better, because we can use exponential mechanism (but still with a random vector of sampled values).

Regarding min and max specifically, there are two other things worth mentioning. First, when fitting synthetic data, we assume access to all rows (rather than querying an engine for aggregates), so we can use the approximate bounds mechanism. After fitting a synthesizer, you can read the inferred min and max from the automatically generated data transformer object. I can add a code sample showing how to do this. It's possible in the current version of the library.

Second, we could add a function that infers the SQL metadata in a differentially private manner. This would be as easy as fitting an auto-generated TableTransformer against SELECT * and then writing the discovered mins and maxes to a YAML in the format used by SQL. This would all work with the current library as well, and I can add a code sample, but it might be worth turning the core sample into a built-in function.

Finally, there could be some ambiguity between the bounds specified in the metadata and the actual min and max. We always clamp to the bounds, so the actual won't be outside the bounds, but it's conceivable that someone could set the bounds far too wide. We do this with the age column for the PUMS yaml, setting lower bound to 0, even though the data only has 18+ individuals. So a MIN on age would return something closer to 18. In the case where the bounds are inferred, though, they're the same as differentially private Min and Max, so it wouldn't typically make sense to waste epsilon computing them again when the analyst requests a min or max.

FishmanL commented 1 year ago

First, i'd love a metadata discoverer and autogenerator for pandas (as I was planning to write one of my own in the next week or too anyway, and it would doubtlessly be worse/less generic).

Second, I'm looking for min/max as an aggregate in the case where the bounds are fixed, so the sensitivity is fixed (to use e.g. in combination with groupby) -- in that case could we not just pass to the get_sum_or_count?

joshua-oss commented 1 year ago

Here's a sample code using the inference from the TableTransformer

import pandas as pd
from snsql import Privacy, from_connection

from snsynth.transform.bin import BinTransformer
from snsynth.transform.label import LabelTransformer
import yaml

def tt_to_yaml(tt, table, column_names, *ignore, schema="", db=""):
    # serilaizes a fit TableTransformer to a yaml file suitable for use with snsql
    if not tt.fit_complete:
        raise ValueError("TableTransformer must be fit before converting to YAML")
    assert(len(column_names) == len(tt.transformers))
    columns = {}
    for i, col in enumerate(column_names):
        lower = None
        upper = None
        nullable = False
        if isinstance(tt.transformers[i], LabelTransformer):
            ctype = "string"
            nullable = tt.transformers[i].nullable
        elif isinstance(tt.transformers[i], BinTransformer):
            lower = tt.transformers[i].fit_lower
            upper = tt.transformers[i].fit_upper
            nullable = tt.transformers[i].nullable
            ctype = "integer" if isinstance(lower, int) else "float"
        else:
            raise ValueError(f"Unknown transformer type: {tt.transformers[i]}")
        columns[col] = {
            "type": ctype,
            "nullable": nullable,
            "name": col
        }
        if lower is not None:
            columns[col]["lower"] = lower
        if upper is not None:
            columns[col]["upper"] = upper
    yaml_dict = {
        db: {
            schema: {
                table: columns
            }
        }
    }
    yaml_dict[db][schema][table]['row_privacy'] = True
    return yaml.dump(yaml_dict)

pums = pd.read_csv("PUMS_null.csv", index_col=None)

# infer metadata
from snsynth.transform import TableTransformer
tt = TableTransformer.create(style='cube', data=pums)
tt.fit(pums, epsilon=0.5)

colnames = [c for c in list(pums.columns)]
y = tt_to_yaml(tt, table="PUMS", schema="PUMS", column_names=colnames)
open("PUMS_gen.yaml", "w").write(y)

reader = from_connection(pums, metadata="PUMS_gen.yaml", privacy=Privacy(epsilon=2.0, delta=10E-5))
res = reader.execute("SELECT married, AVG(income) FROM PUMS.PUMS GROUP BY married")
print(res)

This only works with source data that is one row per individual and which is neatly categorical or numeric, since we don't try to detect unbounded columns, and we don't try to detect if something looks like an identifier. We also treat booleans as strings in this implementation. But this could be a good starting point. I would also like to add a method that goes the opposite direction, to create a TableTransformer from a yaml file.

Also note in this example, it will detect age as ordinal, since we have some hard-coded heuristic that treats integer columns with small cardinality and sequential as ordinal. That's good for the marginal synthesizers, but with SQL we want these to be treated as numeric. The TableTransformer allows the caller to specify that a certain column is numeric when inferring the bounds, but I'm not sure what the best behavior here is.

joshua-oss commented 1 year ago

Regarding your use case for MIN and MAX, do you mean something like:

SELECT educ, MAX(income) GROUP BY educ ORDER BY educ

Where we might have an upper bound set on income, but we'd also expect the max income in each group to be different? Then, are you proposing we take the actual MAX (non-noisy) for each bin, and add Laplace (e.g.) noise with sensitivity as determined from the upper and lower bounds for income?

FishmanL commented 1 year ago

Regarding your use case for MIN and MAX, do you mean something like:

SELECT educ, MAX(income) GROUP BY educ ORDER BY educ

Where we might have an upper bound set on income, but we'd also expect the max income in each group to be different? Then, are you proposing we take the actual MAX (non-noisy) for each bin, and add Laplace (e.g.) noise with sensitivity as determined from the upper and lower bounds for income?

Yes, exactly this.

FishmanL commented 1 year ago

(is this likely to hit master before the larger scale changes to quantiles?)

joshua-oss commented 1 year ago

(is this likely to hit master before the larger scale changes to quantiles?)

Do you mean the change to infer SQL schema? I wanted to hold off on shipping this until the UN PET hackathon finishes next week, because I would hate to accidentally ship a bug that breaks the contestants. But I would like to add the schema inference ASAP after next week. The plan is to:

  1. Add something like the code snippet above to map TT to and from SQL Metadata
  2. Add ability to infer identifiers in TT, and create a ColumnTransformer that handles identifiers. Essentially this would be something that recognizes common formats like sequential integer, email address, GUID, etc. and drops that column on inbound, and generates fake IDs on outbound. This is useful for synthesizers and would allow us to map SQL metadata that has private_id columns
  3. Update snsql Metadata class to allow inference from data, using TT inference. This creates a two-way dependency between snsql and snsynth, so it might make sense to just break transforms off into a separate package that both can reference. We luckily don't have too many dependency issues at the moment, but it's been problematic in the past to install both packages in the same environment on certain OS setups.

Implementing support for MIN and MAX would definitely take longer, and would be done at the same time as Quantiles. I think the approach of adding noise to the non-noisy MIN/MAX would result in too much noise to be useful. For example, in the case of income, supposing we set an upper bound of 450,000, to noise scale for Laplace would be 4,500 with an epsilon of 10.0, which is already a quite high epsilon. In this scenario, a larger sample size doesn't help at all, whereas other mechanisms allow you to eventually get privacy "for free" as increased sample size eventually reduces error caused by privacy to be the same as sampling error. The MIN/MAX algorithm we would use with the approach outlined for quantiles should do quite a bit better in terms of accuracy with small epsilon, and benefits as sample size goes up.

FishmanL commented 1 year ago

Ahhh, fair.

What's the ETA on quantiles?

joshua-oss commented 1 year ago

We are hurtling into the holidays, but "this calendar year" is doable. Much of the foundation work is done, but the remaining part is a bit thorny. We essentially need all of the DB-specific readers to produce syntax that samples 'k' values (e.g. 1,000) uniformly at random and embeds them in a return column of the aggregate row. This requires going beyond SQL-92, and results in SQL that is different for every database. I've verified that they all have syntax that can work efficiently, but it's a matter of getting the AST and grammar and rewriter updated to support all of them, and ensuring the CI tests exercise them in all databases. So it'll consume some time.

Then, fit-and-finish such as, if someone requests 3 quantiles, do we want to use the same sampled values for all 3? Probably? Do we want quantiles over different columns to use different random samples, or do we want to sample in tuples? For example, with MIN(age), MAX(income) GROUP BY married, is it better that the age/income pairs only consist of pairs that occurred in the data, or should they be independent samples? We will probably do independent samples.

angelognazzo commented 1 year ago

Hey, is there any news on the implementation of quantiles, max/min?