Designing privacy-preserving ML APIs

anssiko commented 4 years ago

Branching from the permission model discussion https://github.com/w3c/machine-learning-workshop/issues/72, I feel we should also discuss what are the considerations in design privacy-preserving ML APIs.

Various aspects of privacy have been discussed in the following workshop talks:

Privacy-first approach to machine learning by @HelloFillip
Privacy focused machine translation in Firefox - by @XapaJIaMnu @kpu
We Count: Fair Treatment, Disability and Machine Learning by Jutta Treviranus

Speakers are invited to provide their perspectives in this issue.

I'll open this issue with one concern discussed in this space: fingerprinting. This is a broader issue that is not specific to ML APIs but touches a wide variety of Web APIs.

Specific to ML APIs, to give an example, in https://github.com/webmachinelearning/webnn/issues/85 @kpu reported a possible fingerprinting vector that may allow determining hardware capabilities of the underlying hardware that executes ML operations. The community working on ML APIs for the web is committed to carefully evaluate all reported issues and works with the privacy experts to develop privacy-preserving ML APIs.

In connection with #72, a subset of fingerprinting mitigations would benefit from a permission model that enables technical means to mitigate these threats using either system-level policies and/or user-consent mechanisms to name a few. In addition to permission model, a number of other mitigations have been defined and specified to be applied on a cases by case basis as to provide flexibility to implementers depending on the characteristics of the mitigations in place in the stack below the browser, including underlying libraries, OS, hardware.

In the context of ML APIs, I believe the initial area of focus should be on active fingerprinting protections, but I'll loop in @npdoty, an expert in the field and a member of the W3C's Privacy Interest Group (PING) to share PING perspectives on fingerprinting in particular for web graphics APIs an area of active privacy research that share similar characteristics to web ML APIs.

kpu commented 4 years ago

I run the project behind Privacy focused machine translation in Firefox; @XapaJIaMnu works for me. Our goal is running machine translation locally to preserve privacy compared to using a cloud service. For our project, we want the fastest possible implementation because that will enable us to run in the browser and therefore preserve the privacy of more users.

Translation is computationally expensive and we've put a lot of research and code optimization into optimizing native code in 8-bit precision to run fast enough on desktop CPUs, including a collaboration with two on-site Intel employees.

We can run in reasonable time natively. But the installation experience is much better as a pure web extension, which would imply running on web APIs. We cannot afford a 24x slowdown, which has drawn my interest to getting the WebNN APIs implemented, particularly 8-bit GEMM. And I need these to be as fast as possible because speed is a crucial feature of the project.

So we're in the somewhat unusual position of (hopefully) being trusted enough to be a browser extension that can read and edit web pages, at which point fingerprinting is moot. But I understand that fingerprinting is an issue for use by web pages.

Jutta-Inclusive commented 4 years ago

My concern is with people who are highly unique, and therefore their data is highly unique. They face two threats, the first is that the data privacy protections don't work for them because they can be easily re-identified, especially when a bad actor gets access to aggregate data. The second threat is that the data that is removed to anonymize the data set is often the important data regarding their requirements. This is the situation for many people with disabilities. If you have a disability you frequently need to ask for special treatment and thereby barter your privacy for essential services.

Jutta-Inclusive commented 4 years ago

@HelloFillip, I wonder what you think about cooperative data trusts as another alternative to edge device or offline ML. These are owned and governed by the data producers. Examples include midata.coop

toreini commented 4 years ago

Hi, I wonder what will be the strategy for key management in a federated learning approach? can you trust the user's hardware to do the encryption and store the keys?

HelloFillip commented 4 years ago

@Jutta-Inclusive Cooperative data trusts aren't an alternative to edge or offline ML as data is still being held elsewhere beyond the scoped storage needed for training or inference.

That's not to say they're bad, (I think some of them are very useful if built with a zero-trust model), but they add a level of risk, complexity with personal data governance, and the potential to commodify personal data.

If you can isolate the data into secure data pods for each individual then it helps, but there's no alternative to edge / local processing of data.

HelloFillip commented 4 years ago

@toreini Inherently you have to trust the user's hardware. That's their decision and discretion.

Jutta-Inclusive commented 4 years ago

@HelloFillip, agreed regarding cooperative data trusts and local, on device processing. I wasn't thinking that co-ops would replace strategies such as local on device processing of data, but be used in combination to achieve the determinations missing from personal data alone. Cooperative data trusts are often established to address minority topics such as rare illnesses that are either overwhelmed or ignored by other data analytics efforts. Cooperative data trusts are also less likely to follow the extractive patterns, highlighted in the work of Shoshana Zubroff and others, of large enterprise efforts. The co-op members can determine the privacy and security strategy and ensure it addresses their interests rather than corporate interests and better avoid data overreach. In some instances the data trust members may contribute local, on device determined, answers to queries to a trust for aggregate analysis.

w3c / machine-learning-workshop

Designing privacy-preserving ML APIs #90