Open mhauru opened 1 year ago
Thanks for such a concise repro code example.
One way to see what is happening is to add a print(reader.tau)
right after the reader.execute
. In the first example, the threshold is set to a number slightly greater than 100, because we have to assume that each user can appear in up to 100 bins, and the threshold would be roughly 100 as delta approaches 1. You can think of delta as "the chance that epsilon will fail to hold", and if each user can potentially be in 100 bins, that user's membership could be interrogated 100 times. Thus we need a threshold a bit above the number of bins per user, to provide protection. The challenge is that the total population size is only 100, so it's impossible to get any bin with a population size that can guarantee privacy over 100 releases.
If you use ids = list(range(1000)
, both examples work, whereas if you use ids = list(range(2))
both examples fail.
One issue here is that max_ids
only limits the total number of rows from each user that can be sampled across the whole database. If that value is 100, we could end up with a user that has 1 row each in 100 bins, and another that has 100 rows in a single bin. We of course know that we can't have a user who has 100 rows each in 100 bins, but we have to account for the worst case. If we instead allowed curators to set both a max_bins
and max_ids_per_bin
, it would be possible to tune the behavior. In your example, you would clearly set max_bins == 2
if we made such a setting available, and then the first example would have tau == 2.1
and everything would work as you expect. Of course, if you ever had a situation where max_bins
was really supposed to be 100, and the population size was 100, it would still yield a threshold bigger than the possible population size.
This approach would have the drawback that the curator would need to have a bit more domain knowledge about the population, and this dual reservoir-sampling could introduce additional sorts of bias than are introduced by population-wide reservoir sampling. However, this could be manageable with good documentation, and wouldn't be hard to implement.
Both are variants of "count Laplace" or "count Gaussian" described in section 1.1 of [1]. If you check Table 2 (on page 19) of that paper, you can see that count Laplace and count gaussian work well when max_bins
is small. Our approach does well when max_ids=1
, but deteriorates rapidly versus alternatives as max_ids
(or, by extension, a hypothetical max_bins
) gets large. The experiments in the paper were based on Zipf distributed data rather than relatively balanced as in your code example, but the results probably hold in general, that max_bins
> 100 would benefit from using "policy Laplace" or "policy Gaussian". Unfortunately, the "policy" methods are computationally expensive and cannot be done inside the database engine. We have code sample of doing in-memory with Python and via Spark. Presumably people wanting very large max_bins
will be operating on very large datasets, in which case performance becomes a problem. In any case, a straightforward way to apply this would be to select the bins first, using the Spark code sample, then use an inner join on the selected bins to filter the source data down, to feed to SmartNoise using censor_dims=false
.
Having said that, I'm not sure if your use case is really one where you want a large max_bins
, or is instead one where you're just getting a large max_bins
because SmartNoise is incapable of distinguishing between the max rows per user and max bins. I assume it's the latter?
We could categorize the common scenarios into 4 groups, depending on whether max_bins
and max_ids_per_bin
are bounded or unbounded:
A. SELECT educ AVG(income) FROM PUMS GROUP by educ
. In this example, max_bins=1
and max_ids_per_bin=1
, because we assume each user can only have one education level, and we have row_privacy
. This is same as max_ids=1
in our current implementation.
B. SELECT Region, SUM(TotalPrice) FROM CustomerOrder GROUP BY Region
. We assume each customer is assigned to a single region, but each customer may have unbounded number of orders. So we would have max_bins=1
and max_ids_per_bin=k
, where k
is some constant to bound the contribution. This is problematic in the current implementation, because we censor dimensions too aggressively based on our need to assume that max_bins==max_ids_per_bin
.
C. SELECT ProductCategory, COUNT(DISTINCT CustomerID) FROM Order GROUP BY ProductCategory
. In this case, each user can appear in many bins, but can appear only once in each bin. It's max_bins=k
and max_ids_per_bin=1
. In this case our dimension censoring will be more acceptable, but we will waste a lot of accuracy by noising the counts at sensitivity of k
. (And as we saw in [1], censoring performance degrades as k
gets large).
D. SELECT url, SUM(DwellTime) GROUP by url
. In this case, each user may visit many web sites, and each user may visit each web site many times. In practice, the number of visits per URL and number of URLs per user can easily reach into the thousands. So these are practically unbounded, and the curator needs to set max_bins=k1
, and max_ids_per_bin=k2
.
We wouldn't likely be able to address this in metadata at the table level, but would instead need to address at the column level. Of course, the simplest pattern would be to assume (and enforce through sampling) that max_bins_per user
is 1 for any categorical column, and then use max_ids
to scale the sensitivity of all measure columns. However, this would break your code example, and in practice people probably have a mix of categorical columns which are always the same for a given user, and categorical columns which are meant to further subdivide the measures for a given user.
We could mark the columns where max_bins_per_user
is clearly known ahead of time, like:
max_ids=20
region:
type: string
max_bins_per_user: 1
sales_month:
type: string
max_bins_per_user: 12
And then set max_bins
for tau thresholding using the product of the values (e.g. 1 12 for region sales_month). Something like product_name
would be more challenging, since you might want to use a larger max_bins
for product_name
if grouping by region
than if grouping by sales_month
. For example, grouping by region, you might want to select up to 120 products, whereas grouping by sales_month, you might only want to take 10 products per sales_month. This heuristic would depend heavily on your population size (as we saw above), and on the data distribution. There is also the fact that fields like region and sales_month could be considered public dimensions, and could be handled by an outer join and zero counts rather than dimension censoring.
Having said that, I'm not sure if your use case is really one where you want a large max_bins, or is instead one where you're just getting a large max_bins because SmartNoise is incapable of distinguishing between the max rows per user and max bins. I assume it's the latter?
Yes, it is the latter. I can work around this with some preprocessing like in my example, but I was confused as to why the preprocessing affected censor_dims
.
Thanks for your very comprehensive answer. I think I get the gist of it, the 4 groups of cases are very illustrative, but I'll have to go chew on this a bit. Feel free to close this since I think you've answered my question, or leave open if you think you might want to implement some of the max_bins_per_id
stuff.
Thanks; I'll keep this open as a work item to do at least two things:
In the docs we should also provide some advice about selecting max_ids, how to address by pivoting, handling public dimensions, etc. Please add more comments if you think of other heuristics we could use to provide better defaults.
I have a case in which I don't understand the behavior of
censor_dims
. Namely, I run essentially the same query in two different ways but get different results. The data is a table with two columnscategory
andid
. The two ways areCOUNT(*) ... GROUP BY category
query withmax_ids=100
COUNT(*) AS counts ... GROUP BY category, id
and then on the result of that aSUM(counts) ... GROUP BY category
query withmax_ids=2
(becausecategory
takes two different values) butupper=100
forcounts
.I would expect the results be the same, and they roughly are, except for
censor_dims
which consistently censors the results in the first case, but not in the second.I'm not sure if this is an issue with smartnoise-sql or with my understanding of DP/the censor dims mechanism, but I'm hoping to get some clarity on that. Here's a code snippet that illustrates:
Typical output of running the above is
Tagging @fhoussiau because we had a chat about this earlier today.