google / differential-privacy

Google's differential privacy libraries.
Apache License 2.0
3.09k stars 353 forks source link

Difference between anon function results and normal function results. Anon function giving 0 where as normal function result is a higher magnitude value(which is not any where close to 0) #17

Closed AbhishekNalamothu closed 3 years ago

AbhishekNalamothu commented 4 years ago

Anon function is giving 0 where as normal function result is a higher magnitude value (for example : 20, 30, -25, -40).

Example: difference between anon function and normal function results
anon function : anon_F(D)-> 0   
normal function : F(D) -> 20 

Providing differential privacy enabled aggregated data with the above difference (in example) to an end user might mislead him/her during their analysis.

Is there a way to handle this?

celiayz commented 4 years ago

Hi Abhishek,

Differential privacy hides the contribution of any single user. If the original function and anon function give dramatically different results so that analysis is misleading, then the original did not contain enough contributing users to make anonymous analysis useful.

AbhishekNalamothu commented 4 years ago

Thanks @celiayz . How can I avoid this zero problem? If you have any suggestions, could you please suggest?

This modified number_of_carrots_eaten data and the following query on this data reproduces this zero problem.

select d.animal_group, count(1), sum(case when count_carrots_eaten = 0 THEN 1 ELSE 0 END) as zero_counts,((1-avg(d.count_carrots_eaten))*100) as Zero_percent, avg(d.count_carrots_eaten),
sum(d.COUNT_CARROTS_EATEN) as carrots_eaten,
anon_sum(d.COUNT_CARROTS_EATEN, 5) as anon_carrots_eaten
from animals_and_carrots_bin_new d group by d.animal_group order by carrots_eaten;

image

In the animal data set, groups with 70, 80, 90 percent values as zero are showing the 0 problem.

By doing other experiments, I realized that the 0 problem depends on different factors. Number of contributing users and how many of those users having zeros in that group.

I would like to know how can I avoid this problem?

Once again thank you so much.

celiayz commented 4 years ago

The best way to avoid the problem is to add more data. You can also try increasing the value of epsilon and using manually-specified bounds (ex. use ANON_SUM(column, lower, upper, epsilon)).

AbhishekNalamothu commented 4 years ago

Thanks @celiayz for your prompt response.

Suppose we have a larger dataset, aggregated based on 'n' number of groups. When 'm' number of groups have very few smaller data points compared to the rest (n-m) groups. Do we expect those 'm' groups to have 0 value upon aggregation?

What if I do not have more data to add? My use case requires not providing bounds. I want to use approx_bounds provided by google to automatically detect the bounds. Also, I am afraid increasing epsilon may cause security problem.

celiayz commented 4 years ago

Yes, for the 'm' groups that have fewer contributing users to the group, we expect that the data could return null or 0 for that group.

If there is no more data to add, then unfortunately the data set is too small to be able to hide the contribution of a single user statistically. Then differentially private analysis is probably not appropriate when working with the data set.

AbhishekNalamothu commented 4 years ago

@celiayz , returning null would be fine with analysis but returning 0 misleads the analysts. Is there a way to fix such that instead of returning 0 it returns null? Also, If there is an error, or not enough data to process, then isn't it ideal to return an “error”, not 0 because 0 is not an error value.

Thank you @celiayz

celiayz commented 4 years ago

I see, the fact that it is returning 0 is likely an implementation detail to do with the noising + snapping mechanisms. When the value is close enough to 0, the answer gets snapped down to 0. Since the aggregation functions do not know that there isn't enough data, nor is there any error, it will return 0 instead of null. Therefore, I don't see any meaningful way to get the library to return null instead of 0.

dibakch commented 3 years ago

Closing this for now. Feel free to re-open.