patcg-individual-drafts / topics

The Topics API
https://patcg-individual-drafts.github.io/topics/
Other
605 stars 199 forks source link

Topics calculation input data should remain local to the user's machine and the classification model should run locally #193

Closed chrisvls closed 1 year ago

chrisvls commented 1 year ago

Users use the browser to access sensitive data, some of which they are under legal obligations not to disclose to third parties. Some regulatory regimes require explicit permission for third parties -- even when acting as a sub-contractor to an authorized party.

The spec invites browsers to use any and all information about the document to identify topics. If this is local to the user's device and not visible to the browser's company, these requirements are not triggered.

If the browser is sending topic calculation input data back to the browser's servers then things get very complicated. The browser could be sending HIPAA, GDPR, business confidential or other controlled data back to the browser's company. The browser's company may have to notify all browser users of any sub-processors who also handle the data. And so on...

The spec should require that the topic classification input data stay local to the device and that the classification model runs locally.

chrisvls commented 1 year ago

More simply, if people hear that the topics API sends the content of all of the pages you browse back to Google, they will object. Assuming that is not how this is going to work, the spec should say so.

michaelkleber commented 1 year ago

The Explainer does indeed say "For each week, the user’s top 5 topics are calculated using browsing information local to the browser." And Chrome's developer-facing documentation on the API says "This information is recorded on the user's device" and "Chrome's implementation of the Topics API downloads a TensorFlow Lite file representing the model so it can be used locally on the user's device."

But I think you're conflating "What Chrome does" with "What a spec requires". Not all browsers are implemented the same way, and when writing a spec, you should not expect them to be. A browser like Opera Mini, which runs partly in a server operated by the browser maker, will surely do many things differently as a result.

chrisvls commented 1 year ago

Respectfully, I think it is the Topic API spec and explainers that are conflating the spec and the Google implementation. This issue is not limited to this aspect of the spec, as my other issues point out. Indeed, at section 16, the spec's own privacy section asserts that the taxonomy is "human-curated" but section 6 of the spec makes it explicit that that is not part of the spec, just part of the Google implementation. That's conflation.

The project makes privacy promises, some in the spec, some in the explainers. The spec should require an implementation that fulfills those promises. Or the promises should be withdrawn or appropriately caveated.

Local processing is one of those promises.

Opera Mini's privacy policy seems pretty clear that 1) they identify topics, 2) they do so by looking at "categories of web sites", 3) they serve ads in their own applications, 4) they do not share the topics with the third parties, 5) the content of web pages is not disclosed as data they process or collect for this purpose. So I don't see anything where they disclose that they analyze clear text of the content of a business application or Gmail email on their servers. Edit: But it does seem that they may possess it to render and compress pre-rendered pages, though they are explicit that they do not log the content when doing so.

If they do, they shouldn't be considered compliant with the Topic API spec. Or the explainers should stop promising that local processing will be part of the Topics API privacy safeguards. If we think it is ok for the browser company to read every page on the server, then we should say so.

michaelkleber commented 1 year ago

Hey @chrisvls If you're going to suggest things that should change in the spec drafting, could you please (1) update your GitHub profile to include your name and affiliation, and (2) be sure you've joined the PATCG, so that your contributions are covered by the W3C's IP release policy?

Regarding local computation, it seems like your goal here is expanding the spec to spell out things that the browser cannot do with browsing data. But I feel like that kind of protection is not really part of this particular spec, because the data in question here (pages the user has visited) exists in browsers already today, entirely outside of this work's purview. In other words, the spec spells out the steps the browser takes, and of course none of those steps involve sending browsing data to a server, but that doesn't rule out data being sent in some other way.

Hey @jyasskin: It seems like this discussion is moving in the direction of your W3C Privacy Principles work. Is there something that individual specs ought to do to invoke those protections explicitly? Or is that unnecessary (now or in the future, since that work is still in draft status)?

chrisvls commented 1 year ago

Will do.

As for your response on local computation, I am quite confused. First, limiting analysis to local computation wasn't my idea, it is part of how the privacy features of this spec is being marketed.

Second, this particular spec outlines many limitations on browser behaviors/data that exist in browsers already today. The browser could send 100 topics, recorded over months, to sites that have not observed them, with no randomization. The spec explicitly states that is outside the bounds of how the API should be implemented.

So I still don't see why it is ok to have the local processing limitation in the explainers and marketing but not in the spec.

jyasskin commented 1 year ago

Some of the discussion here reminds me of the Privacy Principles section on user agent duties, which says that UAs have to be trustworthy and put their user's interest first. That's a ground rule for UAs, which we assume in all their operation and don't bother specifying anew in each spec. If we did want each spec to refer to it, I think we'd do so by modifying https://infra.spec.whatwg.org/#user-agent, which all web specs do incorporate by reference.

On the rest of the discussion, the marketing and user documentation for a feature necessarily go past what the specification requires, if only because we don't specify UI for features, but the user documentation has to talk about the UI. There's also always tension between requiring more in the spec, and allowing UAs to innovate in the details of how they implement the spec. Remember that web specifications are contracts between user agents and websites, and that drives what appears in the specs. They're generally not contracts between users and UAs, even though users are heavily affected by what's in the spec. Because of that, I like to hold to the fuzzy line that if UA variation will make websites write different code for different UAs, we should probably specify it, and if it won't, then we probably shouldn't.

chrisvls commented 1 year ago

Thank you, @jyasskin, very helpful. I was not thinking about the spec as only describing the contract between site and user agent.

For the site, it is very helpful to think of user agents as just, well, just the agent of the user. Maybe the user uses Opera Mini and trust Opera to see their data. That endpoint is my customer's responsibility. What I do doesn't impact their choices and vice versa.

The server side is clear, too. I am responsible for any third parties receiving any data. So I choose them carefully. We do a bunch of compliance and contractual work before adding a new third party, even when they receive very little data.

The Topics API changes this subtly -- possibly in a way the user thinks is safe.

But, if the browser is sending the whole document to third-party servers, well, that changes it radically.

Now I am sending a command to the user agent that may send the document, well, really any data -- including PII ("Welcome back, Joe Smith!") -- to a third-party server for topic identification. Now I need to think of that browser maker as virtually (or maybe actually) acting as a sub-processor. Do I know what data they are sending? Do I know where they are sending it? What are their security practices? Who are their sub-processors?

Things are no longer clear. My customer could well hold me responsible for having sent the data to the browser maker, since the data was sent at my request. I have no relationship with the browser maker to fall back on if something goes wrong. My site sent the command that sent the data, but my user picked the browser maker. Not clear.

So, to get back to your fuzzy line, I think that, if some browser makers receive data on their servers at my command, I would make sure to write user-agent-specific code to prevent that.

Given the spec in its current state, before my site calls the API, I would be responsible for knowing 1) what data each browser uses as topic calculation input data, 2) where they do the topic calculation, and 3) whether those change.

None of these concerns arises if the user's data stays on the user's machine. Or, I suppose, if the inputs for topic calculation are sufficiently constrained.

dmarti commented 1 year ago

@chrisvls That's a good point. From a California point of view, the user is almost certainly a party to some kind of software license and/or server terms of service with their browser vendor, but when the browser vendor processes the user's data at your request, then they become a service provider to you. In addition to the responsibilities you mention, there is a set of requirements for that contractual relationship (as covered in a helpful article from the Association of National Advertisers)

michaelkleber commented 1 year ago

For the site, it is very helpful to think of user agents as just, well, just the agent of the user. Maybe the user uses Opera Mini and trust Opera to see their data. That endpoint is my customer's responsibility. What I do doesn't impact their choices and vice versa.

This is indeed how I was thinking about the whole situation. Why don't you think that point of view applies to the rest of your question?

As far as I know, there is nothing in any browser spec that requires the browser to only exist on a single computer in the possession of the user. So I can't see how a spec could include any requirement about where particular data is stored. But it seems like it shouldn't matter to you, for the same reason as the Opera Mini example.

But it seems like your concern is different: you're worried that you might be asking the browser to do so something that you wish you had not asked it to do. That is exactly what the spec does offer you: Surely as the person calling the API, you're asking the browser to do the things that the spec describes. So some particular browser's implementation happens to do something else, then surely it wasn't what you were asking.

chrisvls commented 1 year ago

Given the spec, I won't be allowed to ask the browser to do anything. I don't even think I would ask permission from my compliance group. See how I think that meeting would go, below.

As far as I know, there is nothing in any browser spec that requires the browser to only exist on a single computer in the possession of the user. So I can't see how a spec could include any requirement about where particular data is stored. But it seems like it shouldn't matter to you, for the same reason as the Opera Mini example.

As the HTML Standard states, "origins are the fundamental currency of the web's security model." The current standard offers the promise the document is only sent to the user agent, and that local storage and cookies may only be returned to their origin. Does the browser maker get to use that data when you call the Topics API? No limitation appears in the spec. Does the Topics API modify the origin promise by exempting the browser maker's servers? Seems like it.

Today, there is no command a website can send that says to the user agent "please, on my command, send data to a third-party service that I did not choose". The Topics API introduces a novel behavior, one that is outside the function of a user agent (regardless of where the user agent resides). So the Topics API spec will need to have novel scope.

you're worried that you might be asking the browser to do so something that you wish you had not asked it to do. That is exactly what the spec does offer you: Surely as the person calling the API, you're asking the browser to do the things that the spec describes.

If what the spec describes is that, when I call the API, the user agent sends an unknown amount of data to a third party, with whom I have no contract and resides in an unknown jurisdiction, then, indeed the Topics API is doing something I don't want it to do. The spec as is would not pass a compliance review in my shop.

Imagine two compliance meetings...

Compliance meeting scenario one:

I'd like to use the Topics API. If the user has allowed the Topics API, it means that other sites may get to see that our user is interested in the topic of our site, but only if they handle the same topic for our user. None of our data leaves the user's device. The only payload disclosed to third parties is the Topic ID and that doesn't include any PII. And the Topic itself is really vague. So it seems safe. We will only use the API on certain pages, but even if there is some PII in the cache or on the page itself, there's no risk of it going anywhere because when we call the API, it doesn't cause any of our data to leave the user's device.

Might be safe, we'll look into it.

Compliance meeting scenario two:

I'd like to use the Topics API. If the user has allowed the Topics API, it means that other sites may get to see that our user is interested in the topic of our site, but only if they handle the same topic for our user. But, if the user has allowed the Topics API, it means that the browser maker will send data to its servers.

Will the browser maker send the data if we don't call the API?

No. We get to decide.

Ok, then it is our responsibility to determine if it is safe and compliant to send that data to the browser maker's servers. What data does it send?

Really any data it chooses.

Could it access local storage? Or cookie data? What about PII in the user menu?

Maybe on local storage and cookies? Could leak some PII, like the name in the user menu, etc., if it sends the whole page.

Have you ever analyzed the compliance implications of sending what you put in local storage or cookies to third parties?

No.

Do we have a contract with the browser maker for compliance purposes? Do we have their security practices? Can we audit them?

No.

Do you know in which jurisdiction the browser maker's will be sending the data to? Do we know where the servers reside?

No.

It looks like you don't know if it is safe to call the API, so don't.

chrisvls commented 1 year ago

Sorry for the long post, especially when @dmarti has the succinct model for what I'm trying to say.

Calling the Topics API is my choice. It is a choice that makes the browser maker a service provider to me. That means I need to know what data it will receive, how it will process it, and where. It means I need compliance arrangements.

None of the above applies to a user agent that simply renders pages, even in the Opera Mini model.

dmarti commented 1 year ago

@chrisvls See also Privacy Sandbox initiative and Ad Manager

The use of Privacy Sandbox APIs is subject to Google’s EU User Consent policy requirements (for example, obtaining users’ legally valid consent for the collection, sharing, and use of personal data for ads personalization).

It looks like you will need to have the appropriate contracts in place in order to be able to obtain consent on behalf of a Topics API caller. (The way this is worded it looks like it applies to both server-side and on-device processing.) So far this page covers EU but not California or other opt-out jurisdictions.

I agree with @michaelkleber that from a W3C principles POV, the browser vendor should do their server-side processing in the interest of the user, and that from a conceptual POV the vendor's server can be considered as a peripheral of the device that the user is interacting with directly. But it does seem like it would be a good idea to make sure that things work with real world consent/objection/opt-out workflows that sites will be required to support.

chrisvls commented 1 year ago

The compliance review doesn't apply to on-device processing. It is triggered by my site sending a command that sends data to a third party, in this case the browser maker. Perhaps the spec needs a simple statement that, consistent with the HTML Standard, when you call the Topics API, the user agent does not disclose the Document to any third party not authorized by the origin. If this use case is important, than the API can offer an affordance to authorize such disclosure.

michaelkleber commented 1 year ago

Thanks for the interesting discussion, folks.

Chris, I sincerely appreciate the wish for the spec to say that data "remains local", but I fundamentally cannot see how a spec could do that. Terms in a spec need to be fully defined, and local is only a sensible concept if some spec defines something about where the browser is running. And given the diversity computing platforms that exist — personal devices, virtual machines, distributed cloud clusters, my account on a unix cluster running NCSA Mosaic and storing data on a disk shared by thousands of people, etc — this seems like something we don't have any way to go.

While we don't plan do make any change to the spec, it is true that the Chrome implementation of Topics does not cause the browser to send the contents of the page to any server: the ML model that decides what topics a page is about runs entirely locally. So I hope your conversation with your compliance department remains easy at least until some other browser implements the API.

dmarti commented 1 year ago

@michaelkleber It seems like the term "local" is the issue and there could be an alternate way to express it. Yes, you could run Google Chrome for Microsoft Windows, storing Topics API data to a user home directory on NFS, then connect to it with an remote desktop client on a laptop, making "local" hard to define -- but if the NFS server, the MS-Windows machine, and the laptop all belong to the same user it doesn't appear to raise the kind of problems covered in this issue.

It's not a technical issue -- it's about whether Topics API data is stored where it is subject to a "user agent" relationship with the user ( https://w3ctag.github.io/privacy-principles/#user-agents ), or on a device operated by a separate party, which would require a basis for processing and/or be subject to objection/opt-out/Rtk/RtD.

michaelkleber commented 1 year ago

Don, I agree that your approach is different, but that one seems to get into questions like "Who is the sysadmin on the computer you're using?"

From the specification point of view, the spec doesn't say anything about the kind of sharing you're worried about. So you (the call of the API) are surely not asking the browser to do some kind of sharing, but I don't see how to write a spec that circumscribes the implementer's behavior in the way you want. From the implementation point of view, we can talk a lot about what Chrome actually does, but that was never in question.

chrisvls commented 1 year ago

Yes, thanks in to @dmarti's help, in my last post I realized that it is not local that is the trigger, it is my action granting access to data to a third party. Generally, I can only do that with assurances from the third party to me. Is there a Topics API terms of service that states Google's obligations to the Topics API caller?

I think there should be a way of writing this into the specification because the spec as written is relaxing the same-origin policy of the HTML standard. There is a lot in the spec that limits the data shared from one site origin to another. But nothing about the data shared between the site and the implementer / the implementer's origin.

dmarti commented 1 year ago

It seems like this is the kind of problem that IAB TCF is intended to solve -- enabling one party to capture a record of consent in a way that can be shared with other parties. ( https://iabtechlab.com/standards/gdpr-transparency-and-consent-framework/ ) In order to make the Topics API spec itself simpler, it could cite TCF and require that if an implementer chooses to process topics data in a way that would require consent, the implementer is responsible for obtaining an ID in the Global Vendor List and implementing TCF. If a browser does not do processing that would require consent it would not need to implement TCF.

chrisvls commented 1 year ago

@michaelkleber Your formulation of "Who is the sysadmin on the computer you're using?" isn't quite it. My user can make choices to have an untrustworthy computer or user agent, and they are responsible for those choices. But if I choose to send the data to a third party by calling the API, then I'm responsible for that.

With third-party cookies, I control this. I have a list of commitments I collect from all of my vendors.

As written, the spec requires me to trust any and all implementers with all data available. So, there's no way I could call the API generically. I would always have to write user-agent-specific code about whether to call the API.

Is there a terms of service for the Topics API caller that I could give to my legal team from Google that would let them know that Google is committing to local processing of just hostnames? Does it have a defined length of time that I know the commitments are good for?

michaelkleber commented 1 year ago

But if I choose to send the data to a third party by calling the API, then I'm responsible for that.

I understand that, and if we were discussing an API that said "This API sends data to a third party" then I would understand your hesitancy. But every browser implements thousands of APIs. Why do you feel that Topics presents a heightened risk, when (a) the spec doesn't tell the browser to do what you're worried about, (b) the one implementation that exists doesn't do what you're worried about, and (c) the issue that you yourself opened is the only place where anyone has ever even discussed doing what you're worried about?

The Topics spec very clearly spells out the algorithms used in the API. Those steps do not involve sending any data to any third party. Therefore, if you call the API, you are not choosing to send data to a third party. "There’s glory for you," as Humpty Dumpty might say.

As written, the spec requires me to trust any and all implementers with all data available.

Again, this seems like a statement about browsers in general. It has nothing to do with Chrome and nothing to do with Topics.

Is there a terms of service for the Topics API caller[...]?

No — there is no Terms of Service between browsers and web sites. Indeed, the lack of such a ToS has been a core part of the web for as long as it has existed. App stores have APIs w/ToS, not web browsers.

I'm sorry, I'm very aware that I'm coming across as somewhat argumentative here. That's not my intent — we've been working on these APIs for years, and my goal has absolutely been to give developers what they want, when it's possible to do so without compromising user privacy. I just don't see how it is possible for any spec for any feature to make the kind of guarantee you're asking for.

dmarti commented 1 year ago

@chrisvls, sites are required to obtain a D-U-N-S number from Dun & Bradstreet and to enroll with Google in order to use Topics API and certain other features: attestation/how-to-enroll.md at main · privacysandbox/attestation · GitHub.

chrisvls commented 1 year ago

@michaelkleber

No worries about tone! Totally appreciate the response.

the issue that you yourself opened is the only place where anyone has ever even discussed doing what you're worried about

People haven't commented on server-based processing of all data accessible to the browser vendor, because Google has marketed this feature as 1) just using hostname and 2) analysis being done locally to the browser. Also, the spec is relatively new and lightly commented upon, as far as I know. I'm commenting on the difference between the explainer and the spec.

Local processing didn't wind up in the explainer by accident, it did because it's important. So the author of the explainer certainly thought about this thing that I'm worried about. (As did the commenters on the predecessor proposals, I think? Am not sure.)

And, the TAG did note the risk of security breaches exposing sensitive data, so, in addition to the explainer author, I'm not sure that I'm the only commenter that has envisioned that the feature would allow this.

As written, the spec requires me to trust any and all implementers with all data available. Again, this seems like a statement about browsers in general. It has nothing to do with Chrome and nothing to do with Topics.

I would suggest you gather a focus group of chief privacy officers, especially for third-party cookie users, perhaps including some who aren't in the publishing business, and become more familiar with their work. Ask them "if your site calls an API that sends, or may send, PII to a third party to analyze on their servers in any jurisdiction, do you think this is different from what your site authorizes browsers to do today? would you approve it? would you need to review the policies of the third party? would you need a contract with them? what if it was just local processing of just hostname?" The meeting I discussed above is how that would go in my shop.

At the risk of repeating myself, when I call an API that sends PII or may send PII to a third party to analyze that data on my behalf for my business purposes, I'm responsible for that. If my customer's sensitive data is stolen from that third party, they can come to me and say I should not have sent that data to that third party. Statutory requirements for GDPR and CCPA may apply. These are all things I do in a vendor-specific way when I send data via a third party cookie, so it shouldn't be a surprise they apply here.

None of these concerns are true when I call the other browser APIs, which brings us to...

No — there is no Terms of Service between browsers and web sites. Indeed, the lack of such a ToS has been a core part of the web for as long as it has existed. App stores have APIs w/ToS, not web browsers.

So, I have been trying to explain to you that the Topics API as spec'd breaks the current web browser model from the site's perspective. In the normal user agent model, I can rely on the same-origin policy, so I can reasonably assume that the data is only going to the agent my user has selected. If the user is ok with that being on a server, that is up to them. When I call an API or use a cookie, I am responsible for the data flowing to the third party. I have statutory and other obligations that I can carefully fulfill. With the Topics API as spec'd, there is no protection against a different origin receiving PII from the page, cached data, local storage, etc. With the Topics API as explained, I can be reasonably certain that there is no PII involved and I can even make an argument that Google has promised not to possess other data, absent other user consent.

I know you think I'm out in left field here in saying the Topics API goes outside the normal user agent model. But, not to be mean, the W3C TAG itself said that this is so far outside of the user agent model that you should, basically, just stop.

chrisvls commented 1 year ago

@michaelkleber

The Topics spec very clearly spells out the algorithms used in the API.

To quote the spec's definition of the topics calculation input data (section 6):

The attributes could be the document’s URL, the URL’s domain, the document node’s descendant text content, etc, as determined by the browser vendor.

I'm not sure that qualifies as "very clearly spells out." It is a blanket permission for the browser vendor to analyze any data by any algorithm. You have asked what does the spec say the API will do that I don't want? It is this blanket grant, which makes it highly unlikely I could call the API without a contract with the browser vendor.

michaelkleber commented 1 year ago

I think your position is clear. The Topics spec allows browser vendors wide latitude to experiment with different ways of assigning topics, and you find that unacceptable. You are welcome to make a judgement call based on what a particular browser actually does, but there is no way for the spec to help you.

It's worth noting, one last time, that browsers have wide latitude in many other parts of their operation as well. One noteworthy example is the spec https://datatracker.ietf.org/doc/html/rfc6265#section-7.1:

User agents vary widely in their third-party cookie policies. This document grants user agents wide latitude to experiment with third-party cookie policies that balance the privacy and compatibility needs of their users.

chrisvls commented 1 year ago

That paragraph doesn't apply to how compliance reviews work. I'm sorry I have apparently failed to communicate that.

So, my takeaway is that Google's position is: "The Topics API protects privacy by preventing sites from cross-origin tracking so sites don't have access to third-party cookie data. But -- while we ourselves wouldn't do it -- we think the calling the Topics API should authorize the browser vendor to read all data from the document, even normally private data, like cookie and local storage data."

Is this right?

michaelkleber commented 1 year ago

No, I would instead say: "Calling the Topics API asks the browser to make some decision about what topics this page is about (unless called with the skipObservation:true flag). Every browser gets to make its own decision about how to do that. In Chrome's case it involves a locally-executed computation that only looks at the host name."

chrisvls commented 1 year ago

ok, I don't see those two statements contradicting... other than changing "should authorize" in my version to "may authorize"

One technical question: Am I right that the browser could use local storage and cookie data? It could defeat the @jkarlin's advice to only call the API on pages with non-sensitive data and/or no PII.

michaelkleber commented 1 year ago

I can't think of how local storage or cookie data would help answer the question of what this particular page is about.

chrisvls commented 1 year ago

You could find a cross-site ad network id, allowing you to draw correlations about pages from their common users.

My question was more technical as to how to read section 6 of the spec... would it allow that or not?

michaelkleber commented 1 year ago

The spec doesn't say anything about what data can be used, but I can certainly imagine some wording that bounds what is a reasonable range of input data to consider. (But if you don't mind, please open a new issue for that, since it's a quite different topic than what the bulk of discussion on this issue has been about.)

chrisvls commented 1 year ago

Oh, indeed, will open a new one. And thanks for sticking with this epic...

chrisvls commented 1 year ago

Closing now, having opened https://github.com/patcg-individual-drafts/topics/issues/211