google / differential-privacy

Google's differential privacy libraries.
Apache License 2.0
3.08k stars 353 forks source link

Do ZetaSQL examples supports JOIN queries? #233

Open qascade opened 1 year ago

qascade commented 1 year ago

I wanted to write a Zetasql query that joins two tables on a single private column for an ANON_COUNT() query. For example, if there are tables: table1 and table2, both with a common email column.

SELECT WITH ANONYMIZATION OPTIONS(epsilon={{epsilon}}, delta={{delta}}, kappa={{kappa}})
ANON_COUNT( email CLAMPED BETWEEN 0 and 300) AS common_emails FROM table1 JOIN table2 using email

Is it possible to do this? If it is possible to do this using the Go library that would also be great.

dibakch commented 1 year ago

In general, ZetaSQL allows you to join tables and apply DP on top of it. However, our sample binary execute_query only takes one argument for a table defined in a CSV file. You can modify the source code in examples/zetasql/execute_query.cc and define another table in C++ using zetasql::MakeTableFromCsvFile and define email to be the user id using the SetAnonymizationInfo method on the defined tables.

ZetaSQL is written in C++ and uses the C++ DP Lib.

dibakch commented 1 year ago

Let's use this issue to collect if there is interest in this feature. Using join conditions for DP queries might be something that is interesting to try out, since those joins are not straight forward (they need to propagate the column that is used to identify a user for the DP aggregation).

qascade commented 1 year ago

I am trying to use the dp library to run SQL queries that inherently support DP. Section 4 of the DP SQL paper discusses aggregation with joins and compares it with previously built DP SQL engines. In general, joins, especially inner joins, are one of the most sought out queries to be run. I think we should have an example of that and how it affects the accuracy of the results.

qascade commented 1 year ago

Section 2 of Flex Paper comprehensively analyzes the kind of queries considered as a requirement for Practical Differential privacy in the context of SQL queries, which also backs my above claim.