patcg-individual-drafts / ipa

Interoperable Private Attribution (IPA) - A Private Measurement Proposal
Other
36 stars 17 forks source link

add section on Event Labeling Queries #70

Open bmcase opened 1 year ago

bmcase commented 1 year ago

There has been considerable interest in supporting event level outputs in IPA (IPA issue 60, PATCG issue 41). This was discussed at the May 2023 PATCG meeting where the consensus was we could support this so long as we can enforce a per user bound on the information released.

Here we outline a first draft of how an Event Labeling Query that labels events with noisy labels can be done in a way that lets us maintain a per matchkey DP bound. We also consider how these new queries can be compatible with an IPA system that flexibly supports either aggregation queries or event labeling queries.

w/ @benjaminsavage

csharrison commented 1 year ago

A few quick notes:

  1. For dealing with users with multiple contributions I think it will probably be better to cap the sensitivity than to scale the noise dynamically.
  2. Supporting binary labels is good but ideally we would support enum labels / value bucket labels (https://github.com/patcg-individual-drafts/ipa/issues/55, https://github.com/patcg-individual-drafts/ipa/issues/60)

cc @badih

bmcase commented 1 year ago

@csharrison

For dealing with users with multiple contributions I think it will probably be better to cap the sensitivity than to scale the noise dynamically.

Yeah, that would be a nice option. I actually started out trying to do it that way, but ran into some issues. Once you pass the cap you have to still return something for that source row. For rows that exceeded the cap you can't give back a "not labeled" indicator or that can leak information. I think you could try to return just a random label, maybe by setting to 0 and then flipping with the same flip probability as other rows, but it wasn't immediately clear to me that this also wouldn't increase the DP bound on how much information was revealed about that user. It also wasn't clear to me if random labels on some events would be worst for utility than scaling all the noise.

I think we can probably make it work, but like I said this was just a quick first draft to get started.

csharrison commented 1 year ago

I think you could try to return just a random label, maybe by setting to 0 and then flipping with the same flip probability as other rows, but it wasn't immediately clear to me that this also wouldn't increase the DP bound on how much information was revealed about that user.

I think this should be fine for a DP bound and the argument will look really similar to sensitivity capping + Laplace with per-event queries. Consider two databases D and D' that are adjacent. After rate-limiting to sensitivity $s$, a users contribution will look like $(x_1, x_2, ... x_n)$ and $(x_1', x_2', ..., x_n')$ for n events, where the value at any $x_i$ is the label. After capping, these vectors will differ on at most $s$ columns[^1]. If you consider an output of $(y_1, ..., y_n)$, then we have

\frac{Pr[y_1,...,y_n | x'_1,...x'_n]}{Pr[y_1,...,y_n | x_1,...x_n]} = \prod_{i=1}^{n}{\frac{Pr[y_i | x'_i]}{Pr[y_i | x_i]}} 
=  \prod_{i \in differing\_rows}{\frac{Pr[y_i | x'_i]}{Pr[y_i | x_i]}} 

That is, we can analyze each of the $s$ differing columns independently, and if you can show that the mechanism on a single column is bounded by $e^\epsilon$ it implies the whole mechanism is $s \epsilon$-differentially private.

[^1]: This is with an add-remove notion of DP that will just strip conversions for a converting user or add fake conversions for a non-converting user. With a "replacement" notion of DP that merely modifies the conversion patterns for a user the rows will differ on $2s$ values and everything goes through with a factor of 2 in the bound.

martinthomson commented 1 year ago

GitHub fail: image

Shame that I'm not smart enough to parse that out myself.

bmcase commented 1 year ago

Updated the doc to include the version that caps the number of events labeled per matchkey.