Agenda Request - Should PATCG be opinionated on which technologies are used to enable privacy?

eriktaubeneck commented 2 years ago

Agenda+: Should PATCG be opinionated on which technologies are used to enable privacy?

In the private measurement use case, we’ve seen a number of different approaches to establishing privacy across the various proposal/solutions in market.

In many cases, a given use case can be supported with multiple technologies. For example, the Attribution Reporting API in the WICG proposes two solutions to enable Aggregate Attribution Measurement: one supported by Multi Party Computation (MPC) and one supported by Trusted Execution Environments (TEEs).

One interesting component here is that from the client point-of-view, the implementation is similar; the difference is primarily on the server(s) involved in the aggregation of multi-client events. In fact, the base aggregation proposal even proposes (in the future) adding the ability for the user of the API to choose among different aggregation services.

From what I understand, these two technologies (MPC and TEE) are fairly different constructions, and may be seen in different lights by different implementers. As such, it seems like it would be a worthwhile agenda item for an upcoming meeting to discuss these technologies, and their viability for use within the proposals coming from this community group.

Specifically, I am proposing discussing and finding consensus on:

Aligning on our high level privacy and security goals. As a starting point (very much up for debate), I’d suggest that proposals should provide:
1. Client data secrecy: Any parties involved (ad tech providers, helper servers, cloud service providers) should not be able to observe individual level client data beyond what 1st parties can directly observe (e.g. cross site data should be protected.)
2. Purpose limitation: Any parties involved (ad tech providers, helper servers, cloud service providers) should not be able to utilize the proposed API for purposes beyond the specified limited use case.
3. Correctness: Any parties involved (ad tech providers, helper servers, cloud service providers, clients) should not be able to disrupt the correctness of the output.
Presentation from experts in MPC and TEEs about how those technologies can enable the above privacy and security goals.
A recommendation from the group as to what technologies sufficiently enable the above privacy and security goals, to inform what should / should not be used in proposals within this CG.
1. This is likely not achievable in the next working session, but more likely a work item to take up. A good starting point would be to create understanding around the bounds of technologies that might be used.

marianapr commented 2 years ago

I agree this topic does merit discussion in the next PATCG meeting

alextcone commented 2 years ago

I agree this topic does merit discussion in the next PATCG meeting

+1

rmirisola commented 2 years ago

+1

AramZS commented 2 years ago

This looks to me like a good topic for discussion to me as well.

eriktaubeneck commented 2 years ago

Great, looking forward to discussing it. I’m happy to take point on organizing this agenda item.

For 1/, this is very related to the work that @darobin is working on in #18 on Privacy Principles. I imagine the properties I describe above could be part of that document. @darobin, I’m not sure where you’re at with that document, but do you have any desire to lead this part of the conversation?

For 2/, my hope is that some of the experts in the areas of MPC and TEE would be willing to present how/why those technologies enable the properties described in 1/ (as well as the limitations and known issues.) Please consider this a request for speakers, and please volunteer in this comment thread.

For 3/, I don’t expect to reach consensus in this meeting, but (with some luck) the conversation for 1/ and 2/ may create enough context so that we can begin to advance this discussion in issue(s). I would suggest these issues be discussed on the docs-and-reports repo, ideally with suggested additions/changes to the (not yet posted) Privacy Principles doc.

darobin commented 2 years ago

I intend to put together a relatively empty shell of a doc so that people can start hacking on it, but I don't anticipate having a huge amount of text for the group to discuss.

If it's helpful, I would be happy to introduce the TAG's approach to privacy principles on which we are basing our (advertising-specific) principles. The TAG's doc isn't baked but I think it's in good enough a shape to usefully inform our work. This can also help decide what needs to go into the PATCG doc versus what can be delegated to more general principles. Would that make sense?

I strongly agree about getting people to talk about MPC, etc. and setting ourselves up for successful async work afterwards.

eriktaubeneck commented 2 years ago

@darobin that sounds great, thank you!

marianapr commented 2 years ago

For the discussion on this topic maybe it will be useful to reach out to some experts on the topic outside who might be outside this group but invite them to give presentation and/or participate in the discussion.

eriktaubeneck commented 2 years ago

For the discussion on this topic maybe it will be useful to reach out to some experts on the topic outside who might be outside this group but invite them to give presentation and/or participate in the discussion.

Yes, agreed! @marianapr, would you be willing to reach out to anyone on either of these topics? I'm planning to do the same on my side as well.

ajknox commented 2 years ago

I'd like to suggest Andras Slemmer @exFalso as an expert for TEE. He has hands on experience shipping products with several technologies including Intel SGX, AMD SEV, and the TEE-adjascent AWS Nitro.

exFalso commented 2 years ago

Hi there, happy to talk about modern TEE techs!

Rough outline:

What are TEEs, what are the properties we're generally looking for? (isolation, attestation, TCB recovery, data at rest)
Concrete examples. Process-based: Intel SGX, VM-based: AMD SEV-SNP, Intel TDX(, AWS Nitro)
Programming model(usability)
Performance
Availability
Attacks, threat model

I'm not an expert on MPCs so I can't do a fair comparison, but I can talk about how TEEs can be used to implement the goals laid out in point 1 in the original post.

marianapr commented 2 years ago

I think on the MPC side some good people to reach out to can be Dan Boneh and Nigel Smart both of who have great experience in this area also from a practical perspective. On the TEE side of things, Daniel Genkin comes to mind, he has done a lot of hardware security and side channel work.

betuldurak commented 2 years ago

Thanks @marianapr for such a good idea and @ajknox for suggesting a hands-on engineer from decentriq which is a good call. I believe we may use two sessions: one on the capabilities of TEEs (to make us familiar) and the other is on security vulnerabilities/guarantees (to address some of the concerns raised about this specific technologies). It would be great if we could bring an academic who worked on the side-channel analysis and attacks and who is also engaged with BlackHat and DefCon community -where real stuff happens-. I don't work in that domain, but was pointed out that Yuval Yarom or Daniel Gruss would be valuable guests to talk about their attacks on TEEs to give us perspective.

For MPC, I support Nigel or Boneh's invitation.

Given that it would be difficult to fetch academics in a short notice, probably we can discuss about these invitations in the upcoming meeting.

eriktaubeneck commented 2 years ago

Thanks folks! @AramZS, do we have a rough idea of which day and how much time we'd like to spend on this agenda item? I'm happy to organize the different sub-components.

@exFalso, it would be great to have you give this talk. Once we have the day and timing, I'll post here to confirm with you. Probably plan for about 10 mins.

@marianapr, would you be able to reach out to Dan Boneh or Nigel Smart to see if they'd be able to join this conversation?

@betuldurak, thanks for the suggestions. Understood that it may be difficult given the short timing, but would you be able to reach out to either of your suggested speakers to join as well?

I don't expect this conversation to end at this meeting, and for this to continue async. However given that meetings are quarterly, it would be great to have as much context for this meeting as possible.

marianapr commented 2 years ago

Both Dan and Nigel are interested to join for a discussion in principle but it will really depend on the exact scheduling - they are in Europe and CA. Maybe it will be best if the chairs coordinate directly with them about what times might work for them, I will be happy to make an introduction.

AramZS commented 2 years ago

I think we're locking into 9-12 ET (translate to local timezones) to help with connecting our European participants. I think our target is to take on this topic April 5th (day 1) and seeing that there is a lot of potential to talk to on this and numerous participants I think we should take 1h 15m to cover this topic on day one. @eriktaubeneck do you think that's reasonable? We've got some topics that we should get to at the top, but anticipate starting at the 30m point on day 1.

@eriktaubeneck does that sound good to you? I can then leave it to you to organize this chunk of the session. I think we also would want to hear from @darobin during this time to give us insight into the TAG privacy approach.

betuldurak commented 2 years ago

Yuval Yarom @javali7 will join us to give an overview on TEEs followed with discussions and Q&A. @AramZS could you please coordinate with him about the day/time?

eriktaubeneck commented 2 years ago

@AramZS that works for me, thanks!

@betuldurak, I can follow up with @javali7 on timing. Thanks so much for reaching out.

eriktaubeneck commented 2 years ago

Nigel Smart is unable to make it, but sent this summary (with permission to share here to the group):

TEEs

Fast to process data

You need to trust the person holding the TEE to not run a side-channel attack. These have not yet been run on non-crypto code [to my knowledge] on a TEE, but that is going to be easier than breaking crypto code on a TEE IMHO. Side-channels are inherently going to be a problem with TEEs as the basic computer architecture is inherently full of deep pipelines, caches and prediction mechanisms. This is not going to go away soon; we have had 30 years of computer architecture work going in the other direction.

MPC

Can process some things very very fast.

Requires multiple parties

Here the real business benefit is not really processing private information, but allowing different parties to come together to unlock data which they could not [or did not want to] do before. More partnership enhancing technology rather than privacy enhancing technology.

This last point is often missed in industry when I talk to them. Private computing is in some sense a poor analogy. There is no point computing something privately if the thing you compute breaks privacy [50% of all proposed applications I have seen do exactly this]. To apply MPC or TEEs one may need to re-engineer the end application, which may be impossible [a lot of data science workflows have this problem]. This wipes out another 25% of the proposals that come across my desk for applications. That leaves 25% of existing applications potentially suitable. Rule out another 90% due to the tech not being fast enough and you get the small percentage for which TEE/MPC/FHE can be currently applied.

However, the important point above is about existing applications. The real benefit is doing stuff, and opening new opportunities, which were not available before. Thus the term "private compute" makes people think of taking some existing computation and making it private. This leads to straight jacketed thinking, so IMHO its best to avoid such terms and concentrate on "partnership enabling technologies".

Even TEEs require two parties, the one who has the TEE and the one who sends the data to it.

FHE has two parties. The one who computes, and the one who decrypts

MPC obviously has more than one party ;-)

eriktaubeneck commented 2 years ago

I've written up a draft timeline for this agenda item. A few things that I want to highlight:

I'm still looking for someone to present on MPC. I'm continuing to reach out (and if anyone is willing to volunteer, please do so here!) Worst case, I can do my best to give a high level overview to at least kick off the conversation. I expect this will be an ongoing conversation in an issue, and my goal isn't to answer all questions in this meeting.
I reached out to @darobin about his overview of TAG privacy principles, but given the amount content already in this section, he suggested we may be better suited to move that overview to day 2 along with other PAT principles. @AramZS, any thoughts here?

AramZS commented 2 years ago

1: I think that's fine.

2: @eriktaubeneck I think that's fine, it does look like you are filling the space nicely without @darobin's help.

bmayd commented 2 years ago

I think the original question posed in this issue, should PATCG be opinionated on technologies, is really important for us to address as is the first identified discussion item: aligning on our high level privacy and security goals.

In my opinion, discussion of foundational topics like these should happen independent of, and ideally before, discussion about specific implementations. I'm not suggesting that we come to final agreement, but that we have general consensus regarding them so that as we explore potential implementations we have a pretty good idea of what they need to support. I find reasoning about what we are trying to accomplish becomes much more complex in the context of discussion about how it is done and talking about how is much harder if we haven't settled on what.

So, I'd like to suggest we take up separately the questions:

Should PATCG be opinionated on technologies? (My opinion: we should focus on protocols that support various technologies rather than coupling ourselves to anything specific).
What are our high-level privacy and security (and I'd add verifiability) goals and requirement?

AramZS commented 2 years ago

@bmayd Do you think the privacy principles document is the right place to have that discussion (on the second point)?

bmayd commented 2 years ago

Perhaps; I assumed privacy principles was more specifically focused on a working definition of privacy and conditions under which it is preserved or compromised. I thought of it as a precursor which would guide and inform discussions of utility, reliability, trustworthiness, etc. and that we might consider each of those aspects in light of the privacy principles, but in their own right.

AramZS commented 2 years ago

@bmayd Ok, I'm wondering about sequence here, does it make sense to have the privacy principles discussion this time and we can start defining how that might support those discussions in our sync session next week? Should we add a note to discuss this in that slot as well as what @darobin is going to go through? Or do we slot that for our next meeting at the top of the list?

eriktaubeneck commented 2 years ago

I can provide more details shortly, but I disagree with the need for strict sequencing here. I also don't expect this to be the end of this conversation, and I think it would be ideal to start it this meeting.

eriktaubeneck commented 2 years ago

The way that I've been thinking about this is that Private Computation (to Nigel's point above, probably worth renaming) is an abstract tool which we can use and has the three properties defined in my initial comment (client data secrecy, purpose limitation, correctness.) We can visualize this as:

flowchart LR
A[1. Client Data] --> |Encrypted| B[2. Private Computation]
B --> |Revealed| C[3. Output]

Now (again, to Nigel's point above), simply reconstructing any computation into this form doesn't actually make it private. As an oversimplified strawman, if your computation was select * from all_client_data, this clearly wouldn't be private. This presents two main questions:

Is there consensus in this group on outputs that are private? (Dependant on consensus on a working definition of privacy.)
Given an output that is deemed to be private, is there consensus in this group as to the technologies which could achieve the private computation properties described above.

I don't believe there is a hard dependance between these two, unless the answer to 1/ is that there is not consensus on any output (which given other conversations seems unlikely.)

There is a third question, which is dependant on 2:

Given a set of technologies that this group agrees satisfies the properties of a private computation, which parties are semi-trusted to participate in such a protocol? (Dependant on the particular threats that participation may create, which is different for different technologies.)

For now, I propose we move @darobin's content to the slot on the second day as we discussed above. I see that squarely in addressing 1/, and keep the agenda for this issue (addresseing 2/) as is (though I will adjust slightly to remove @darobin's slot.)

bmayd commented 2 years ago

As an example of what concerns me: the diagram above assumes client data, sensitive enough to require encryption, is going to both be needed and made available to private compute environments and asserts that the questions to be addressed are around the outputs of those environments. I'm somewhat uncomfortable with that initial assumption and I think we would do well to be explicit about it and determine if there is general agreement with it, before we devote significant resources to discussing solutions.

If folks have already determined that sensitive data is required and that it will be available, I would find it very helpful to preface discussion of the technologies with review of those things.

eriktaubeneck commented 2 years ago

@bmayd I'm not sure I agree with your framing. Specifically:

If folks have already determined that sensitive data is required and that it will be available

I think (and am assuming) that we have broad consensus that individual cross site behavior data is considered sensitive data, and we have a number of measurement proposals (Attribution Reporting API, IPA, PCM) which construct some way of making that available.

To be clear, the diagram above is not meant to be the framework for everything this group does, but it does seem to be a helpful abstraction for a common pattern that emerges in some of the existing proposals within this space.

I would find it very helpful to preface discussion of the technologies with review of those things.

This review is better suited for other topics on the agenda, specifically the Update on the Privacy Principles and the consensus on the charter. Unless the consensus is that we will do nothing in the form of a private computation, then I don't see any harm in making progress in a discussion around the underlying technologies this group is already proposing being leveraged.

exFalso commented 2 years ago

Is there consensus in this group on outputs that are private? (Dependant on consensus on a working definition of privacy.)

My understanding is that there is a qualitative difference between privacy and confidentiality.

Privacy concerns information loss (or equivalently privacy loss), so it's really a property of an algorithm. For example, a function that takes a list of salaries and outputs the average can be argued to be somewhat private, because it reveals some but not all information about the input. A function that takes the input and returns noise is completely private (and useless). And a select function is not private at all. The widely accepted formalism to reason about privacy is differential privacy which has various (actually too many) definitions, but basically tries to bound how much information about the input one can reconstruct from the algorithm's (or "mechanism's*") outputs.

Confidentiality is a property of the data: a piece of data is confidential if only authorized parties have access to it. So e.g. my browser traffic to https://github.com is confidential, because only I and the github server have access to it. Whereas if I browsed a plain http:// website, the traffic wouldn't be confidential, because intermediate network nodes can access the data.

Generally speaking, confidentiality is well-understood and fairly "easy", privacy on the other hand is very very difficult. This is because privacy is stateful, if you reveal the average salary and independently the median salary, then the privacy loss compounds, and it's very difficult to track this properly. One of the nice properties of differential privacy is that the measure of privacy loss is additive, but this still makes it very hard to actually implement a privacy-tracking system (you need to introduce some notion of a "privacy quota"/"privacy budget").

With the computations I have worked with, we basically sidestepped the issue of measuring privacy loss by saying that all parties need to agree on the computation that is about to take place on their joined data. The downside of this is that there is no flexibility around the algorithm after the provisioning of the data. Whereas if we had a privacy-tracking system, then one could provision their data with the knowledge that no matter what computation is run against it, the privacy loss is limited. But again, such a system is very difficult to implement in practice.

darobin commented 2 years ago

Dear all,

I don't think that we can usefully "define privacy" for the context in which we work, but I think that we can reach consensus about how to manage privacy topics here. That's definitely worth discussing.

Privacy is the set of rules that govern information flows that are about people or that impact people. These rules are context specific and contexts can nest and overlap. (For the nerds out there, this is grounded in the GKC Privacy framework; it's well adapted to commons situations like ours.)

For the general question of privacy on the web, the TAG has a set of principles. They're not finalised, but they're usable. For the more specific context of advertising on the web, we should inherit and specialise them. To give one example, the TAG's principles have stern warnings about consent but don't rule it out entirely. The reason for that is because some Web processing is meaningfully consentable (eg. [ ] Receive newsletter). For advertising contexts, where the threat is sharing browsing, we know that that's not consentable and so we can have a specialised rule that excludes that approach.

I don't think that we can a priori come up with all the rules at once. We're going to find some corner cases because reality is complicated and that's fine. But we can agree to ground in TAG and to elaborate new rules as we progress. I think that that's actually better and more realistic than coming up with all the rules first.

Does that mean that we should be opinionated as to technologies? Yes, but we don't necessarily know how yet. One crucial aspect of commons thinking is that only the rules that are actually enforced count ("rules in use"). We're going to want rules that we can prove work, not pinky promises. (I think this is a better framing than the "mostly technical" requirement, even though it's often the same.) So being on the same page as to which technologies work for what strikes me as particularly useful. It's like a shared toolbox. I think of these techs as "ways to be opinionated." It doesn't mean that we put the cart before the ox.

-- Robin Berjon VP Data Governance Acting VP Marketing Analytics The New York Times On April 1, 2022 18:15:44 Erik Taubeneck @.***> wrote:

@bmayd I'm not sure I agree with your framing. Specifically: If folks have already determined that sensitive data is required and that it will be available I think (and am assuming) that we have broad consensus that individual cross site behavior data is considered sensitive data, and we have a number of measurement proposals (Attribution Reporting API, IPA, PCM) which construct some way of making that available. To be clear, the diagram above is not meant to be the framework for everything this group does, but it does seem to be a helpful abstraction for a common pattern that emerges in some of the existing proposals within this space. I would find it very helpful to preface discussion of the technologies with review of those things. This review is better suited for other topics on the agenda, specifically the Update on the Privacy Principles and the consensus on the charter. Unless the consensus is that we will do nothing in the form of a private computation, then I don't see any harm in making progress in a discussion around the underlying technologies this group is already proposing being leveraged. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

marianapr commented 2 years ago

I agree with Erik's formulation of the problem and the fact that many solutions will be leveraging Private Computing techniques (certainly all the designs we have discussed so far are doing that). Understanding the different technologies that have been put forward, the guarantees and properties that they offer, in my opinion does merit broad discussion in this forum. I expect that in this upcoming meeting we will be able only to open the discussion and hopefully come up with a plan how bring all necessary expertise to continue in following meetings.

exFalso commented 2 years ago

Slides of presentation: https://docs.google.com/presentation/d/1IEzYfZdzSDVP8pPAAzuD4GYNVbyX25iPXm6U7Equy2A/edit?usp=sharing

AramZS commented 2 years ago

Presentation 1: https://docs.google.com/presentation/d/1huChDQU_JsEk5TgcSqZXc7kRvlh2f4urqlVfY2WIjn4/edit#slide=id.g1231d745f33_0_189

AramZS commented 2 years ago

Thanks to all participants. I suspect we will be revisiting this question in the future, but for now we'll close this issue.

patcg / meetings

Agenda Request - Should PATCG be opinionated on which technologies are used to enable privacy? #39

Agenda+: Should PATCG be opinionated on which technologies are used to enable privacy?