opendp / smartnoise-sdk

Tools and service for differentially private processing of tabular and relational data
MIT License
254 stars 68 forks source link

Behavior of `censor_dims=true` that I don't understand #568

Open mhauru opened 1 year ago

mhauru commented 1 year ago

I have a case in which I don't understand the behavior of censor_dims. Namely, I run essentially the same query in two different ways but get different results. The data is a table with two columns category and id. The two ways are

  1. By doing a COUNT(*) ... GROUP BY category query with max_ids=100
  2. By doing a non-DP preprocessing of COUNT(*) AS counts ... GROUP BY category, id and then on the result of that a SUM(counts) ... GROUP BY category query with max_ids=2 (because category takes two different values) but upper=100 for counts.

I would expect the results be the same, and they roughly are, except for censor_dims which consistently censors the results in the first case, but not in the second.

I'm not sure if this is an issue with smartnoise-sql or with my understanding of DP/the censor dims mechanism, but I'm hoping to get some clarity on that. Here's a code snippet that illustrates:

import numpy as np
import pandas as pd
import random
import snsql
import yaml

def run_dp_query(df, query, epsilon, delta, metadata_yaml):
    metadata = yaml.load(metadata_yaml, Loader=yaml.SafeLoader)
    snsql_metadata = {"": {"": {"test_table": metadata}}}
    privacy = snsql.Privacy(epsilon=epsilon, delta=delta)
    reader = snsql.from_df(df, privacy=privacy, metadata=snsql_metadata)
    private_result = reader.execute(query)
    return private_result

def main():
    # Create some data
    random.seed(0)
    ids = list(range(100))
    df1 = pd.DataFrame(
        [
            {
                "id": random.choice(ids),
                "category": random.choice(["a", "b"]),
            }
            for i in range(10000)
        ]
    )

    # Set DP parameters
    # NOTE the extremely large value of epsilon.
    # The surprising behavior is there with all smaller values too.
    epsilon = 100.0
    delta = 0.0001

    # Run a DP query that counts values by category.
    query1 = """SELECT
          COUNT(*) AS counts, category
          FROM test_table
          GROUP BY category"""
    metadata_yaml1 = """
          max_ids: 100
          id:
            type: int
            private_id: true
          category:
            type: string"""

    private_result1 = run_dp_query(df1, query1, epsilon, delta, metadata_yaml1)
    print("DP counts the first way:")
    print(private_result1)

    # Run the same query in a different way, with some preprocessing.
    # Namely, first count the values in each category for each id without DP, then sum
    # over different ids with DP.
    df2 = df1.groupby(["id", "category"]).size().reset_index(name="counts")
    query2 = """SELECT
          SUM(counts) AS counts, category
          FROM test_table
          GROUP BY category"""
    metadata_yaml2 = """
          max_ids: 2
          id:
            type: int
            private_id: true
          category:
            type: string
          counts:
            type: int
            lower: 0
            upper: 100"""
    private_result2 = run_dp_query(df2, query2, epsilon, delta, metadata_yaml2)
    print("DP counts the second way:")
    print(private_result2)

main()

Typical output of running the above is

DP counts the first way:
[['counts', 'category']]
DP counts the second way:
[['counts', 'category'], [5071, 'a'], [4930, 'b']]

Tagging @fhoussiau because we had a chat about this earlier today.

joshua-oss commented 1 year ago

Thanks for such a concise repro code example.

One way to see what is happening is to add a print(reader.tau) right after the reader.execute. In the first example, the threshold is set to a number slightly greater than 100, because we have to assume that each user can appear in up to 100 bins, and the threshold would be roughly 100 as delta approaches 1. You can think of delta as "the chance that epsilon will fail to hold", and if each user can potentially be in 100 bins, that user's membership could be interrogated 100 times. Thus we need a threshold a bit above the number of bins per user, to provide protection. The challenge is that the total population size is only 100, so it's impossible to get any bin with a population size that can guarantee privacy over 100 releases.

If you use ids = list(range(1000), both examples work, whereas if you use ids = list(range(2)) both examples fail.

One issue here is that max_ids only limits the total number of rows from each user that can be sampled across the whole database. If that value is 100, we could end up with a user that has 1 row each in 100 bins, and another that has 100 rows in a single bin. We of course know that we can't have a user who has 100 rows each in 100 bins, but we have to account for the worst case. If we instead allowed curators to set both a max_bins and max_ids_per_bin, it would be possible to tune the behavior. In your example, you would clearly set max_bins == 2 if we made such a setting available, and then the first example would have tau == 2.1 and everything would work as you expect. Of course, if you ever had a situation where max_bins was really supposed to be 100, and the population size was 100, it would still yield a threshold bigger than the possible population size.

This approach would have the drawback that the curator would need to have a bit more domain knowledge about the population, and this dual reservoir-sampling could introduce additional sorts of bias than are introduced by population-wide reservoir sampling. However, this could be manageable with good documentation, and wouldn't be hard to implement.

Both are variants of "count Laplace" or "count Gaussian" described in section 1.1 of [1]. If you check Table 2 (on page 19) of that paper, you can see that count Laplace and count gaussian work well when max_bins is small. Our approach does well when max_ids=1, but deteriorates rapidly versus alternatives as max_ids (or, by extension, a hypothetical max_bins) gets large. The experiments in the paper were based on Zipf distributed data rather than relatively balanced as in your code example, but the results probably hold in general, that max_bins > 100 would benefit from using "policy Laplace" or "policy Gaussian". Unfortunately, the "policy" methods are computationally expensive and cannot be done inside the database engine. We have code sample of doing in-memory with Python and via Spark. Presumably people wanting very large max_bins will be operating on very large datasets, in which case performance becomes a problem. In any case, a straightforward way to apply this would be to select the bins first, using the Spark code sample, then use an inner join on the selected bins to filter the source data down, to feed to SmartNoise using censor_dims=false.

Having said that, I'm not sure if your use case is really one where you want a large max_bins, or is instead one where you're just getting a large max_bins because SmartNoise is incapable of distinguishing between the max rows per user and max bins. I assume it's the latter?

We could categorize the common scenarios into 4 groups, depending on whether max_bins and max_ids_per_bin are bounded or unbounded:

A. SELECT educ AVG(income) FROM PUMS GROUP by educ. In this example, max_bins=1 and max_ids_per_bin=1, because we assume each user can only have one education level, and we have row_privacy. This is same as max_ids=1 in our current implementation. B. SELECT Region, SUM(TotalPrice) FROM CustomerOrder GROUP BY Region. We assume each customer is assigned to a single region, but each customer may have unbounded number of orders. So we would have max_bins=1 and max_ids_per_bin=k, where k is some constant to bound the contribution. This is problematic in the current implementation, because we censor dimensions too aggressively based on our need to assume that max_bins==max_ids_per_bin. C. SELECT ProductCategory, COUNT(DISTINCT CustomerID) FROM Order GROUP BY ProductCategory. In this case, each user can appear in many bins, but can appear only once in each bin. It's max_bins=k and max_ids_per_bin=1. In this case our dimension censoring will be more acceptable, but we will waste a lot of accuracy by noising the counts at sensitivity of k. (And as we saw in [1], censoring performance degrades as k gets large). D. SELECT url, SUM(DwellTime) GROUP by url. In this case, each user may visit many web sites, and each user may visit each web site many times. In practice, the number of visits per URL and number of URLs per user can easily reach into the thousands. So these are practically unbounded, and the curator needs to set max_bins=k1, and max_ids_per_bin=k2.

[1] https://arxiv.org/pdf/2002.09745.pdf

joshua-oss commented 1 year ago

We wouldn't likely be able to address this in metadata at the table level, but would instead need to address at the column level. Of course, the simplest pattern would be to assume (and enforce through sampling) that max_bins_per user is 1 for any categorical column, and then use max_ids to scale the sensitivity of all measure columns. However, this would break your code example, and in practice people probably have a mix of categorical columns which are always the same for a given user, and categorical columns which are meant to further subdivide the measures for a given user.

We could mark the columns where max_bins_per_user is clearly known ahead of time, like:

   max_ids=20
   region:
      type: string
      max_bins_per_user: 1
   sales_month:
      type: string
      max_bins_per_user: 12

And then set max_bins for tau thresholding using the product of the values (e.g. 1 12 for region sales_month). Something like product_name would be more challenging, since you might want to use a larger max_bins for product_name if grouping by region than if grouping by sales_month. For example, grouping by region, you might want to select up to 120 products, whereas grouping by sales_month, you might only want to take 10 products per sales_month. This heuristic would depend heavily on your population size (as we saw above), and on the data distribution. There is also the fact that fields like region and sales_month could be considered public dimensions, and could be handled by an outer join and zero counts rather than dimension censoring.

mhauru commented 1 year ago

Having said that, I'm not sure if your use case is really one where you want a large max_bins, or is instead one where you're just getting a large max_bins because SmartNoise is incapable of distinguishing between the max rows per user and max bins. I assume it's the latter?

Yes, it is the latter. I can work around this with some preprocessing like in my example, but I was confused as to why the preprocessing affected censor_dims.

Thanks for your very comprehensive answer. I think I get the gist of it, the 4 groups of cases are very illustrative, but I'll have to go chew on this a bit. Feel free to close this since I think you've answered my question, or leave open if you think you might want to implement some of the max_bins_per_id stuff.

joshua-oss commented 1 year ago

Thanks; I'll keep this open as a work item to do at least two things:

In the docs we should also provide some advice about selecting max_ids, how to address by pivoting, handling public dimensions, etc. Please add more comments if you think of other heuristics we could use to provide better defaults.