w3ctag / design-reviews

W3C specs and API reviews
Creative Commons Zero v1.0 Universal
324 stars 55 forks source link

Early design review for the Topics API #726

Closed jkarlin closed 6 months ago

jkarlin commented 2 years ago

Braw mornin' TAG!1

I'm requesting a TAG review of the Topics API.

The intent of the Topics API is to provide callers (including third-party ad-tech or advertising providers on the page that run script) with coarse-grained advertising topics that the page visitor might currently be interested in. These topics will supplement the contextual signals from the current page and can be combined to help find an appropriate advertisement for the visitor.

Further details:

You should also know that...

This API was developed in response to feedback that we (Chrome) received from feedback on our first interest-based advertising proposal, FLoC. That feedback came from TAG, other browsers, Advertisers, and our users. We appreciate this feedback, and look forward to your thoughts on this API.

At the bottom of this issue is both the security survey responses, as well as responses to questions from TAG about FLoC, but answered in terms of Topics.

We'd prefer the TAG provide feedback as (please delete all but the desired option):

☂️ open a single issue in our GitHub repo for the entire review

Self Review Questionnaire: Security & Privacy

2.1. What information might this feature expose to Web sites or other parties, and for what purposes is that exposure necessary?

2.2 Do features in your specification expose the minimum amount of information necessary to enable their intended uses?

Yes. The entire design of the API is to minimize the amount of information about the user that is exposed in order to provide for the use case. We have also provided a theoretical (and applied) analysis of the cross-site fingerprinting information that is revealed: https://github.com/jkarlin/topics/blob/main/topics_analysis.pdf

2.3. How do the features in your specification deal with personal information, personally-identifiable information (PII), or information derived from them?

The API intentionally provides some information about the user to the calling context. We’ve reduced the ability to use this information as a global identifier (cross site fingerprinting surface) as much as possible.

2.4. How do the features in your specification deal with sensitive information?

Sensitive information is reduced by only allowing topics in the Taxonomy that Chrome and the IAB have deemed are not sensitive (the topics in the proposed initial taxonomy are derived from the two respective organization’s advertising taxonomies).

This does not mean that topics in the taxonomy, or groups of topics in the taxonomy learned about the user over time cannot be correlated sensitive topics. This may be possible.

2.5. Do the features in your specification introduce new state for an origin that persists across browsing sessions?

The API provides some information about the user’s browsing history, and this is stored in the browser. The filtering mechanism used to provide a topic to a calling context if and only if that context has observed the user on a page about that topic in the past also stores data. This could be used to learn if the user has visited a specific site in the past (which third-party cookies can do quite easily today) and we’d like to make that hard. There may be interventions that the browser can take to detect and prevent such abuses.

2.6. Do the features in your specification expose information about the underlying platform to origins?

No.

2.7. Does this specification allow an origin to send data to the underlying platform?

The top-frame site’s domain is read to determine a topic for the site.

2.8. Do features in this specification enable access to device sensors?

No.

2.9. Do features in this specification enable new script execution/loading mechanisms?

No.

2.10. Do features in this specification allow an origin to access other devices?

No.

2.11. Do features in this specification allow an origin some measure of control over a user agent’s native UI?

No.

2.12. What temporary identifiers do the features in this specification create or expose to the web?

The topics that are returned by the API. They are per-epoch (week), per-user, and per site. It is cleared when the user clears state.

2.13. How does this specification distinguish between behavior in first-party and third-party contexts?

The topic is only returned to the caller if the calling context’s site has also called the API on a domain about that topic with that same user in the past three weeks. So whether the API returns anything or not depends on the calling context’s domain.

2.14. How do the features in this specification work in the context of a browser’s Private Browsing or Incognito mode?

The API returns an empty list in incognito mode. We feel that this is safe because there are many reasons that an empty list might be returned. e.g., because the user is new, because the user is in incognito, because the site has not seen this user on relevant sites with the associated topics in the past three weeks, because the user has disabled the API via UX controls.

This is effectively the same behavior as the user being new, so this is basically the API working the same within incognito mode as in regular mode. We could have instead returned random topics in incognito (and for new users) but this has the deleterious effect of significantly polluting the API with noise. Plus, we don’t want to confuse users/developers by having the API return values when they expect it not to (e.g., after disabling the API).

2.15. Does this specification have both "Security Considerations" and "Privacy Considerations" sections?

There is no formal specification yet, but the explainer goes into detail on the privacy considerations. The primary security consideration is that the API reveals information beyond third-party cookies in that learning a topic means that the topic is one of the users top topics for the week.

2.16. Do features in your specification enable origins to downgrade default security protections?

No.

2.17. How does your feature handle non-"fully active" documents?

No special considerations.

Responses to questions from the FLoC TAG review, as they apply to Topics

Sensitive categories

The documentation of "sensitive categories" visible so far are on google ad policy pages. Categories that are considered "sensitive" are, as stated, not likely to be universal, and are also likely to change over time. I'd like to see:

  • an in-depth treatment of how sensitive categories will be determined (by a diverse set of stakeholders, so that the definition of "sensitive" is not biased by the backgrounds of implementors alone);
  • discussion of if it is possible - and desirable (it might not be) - for sensitive categories to differ based on external factors (eg. geographic region);
  • a persistent and authoritative means of documenting what they are that is not tied to a single implementor or company;
  • how such documentation can be updated and maintained in the long run;
  • and what the spec can do to ensure implementers actually abide by restrictions around sensitive categories. Language about erring on the side of user privacy and safety when the "sensitivity" of a category is unknown might be appropriate.

A key difference between Topics and Cohorts is that the Topics taxonomy is human curated, whereas cohorts were the result of a clustering algorithm and had no obvious meaning. The advantage of a topics based approach is that we can help to clarify which topics are exposed. For instance, the initial topology we intend to use includes topics that are in both the IAB’s content taxonomy and Google’s advertising taxonomy. This ensures that at least two separate entities had reviewed the topics for sensitive categories. Assuming that the API is successful, we would be happy to consider a third-party maintainer of the taxonomy that incorporates both relevant advertising interests as well as up-to-date sensitivities.

Browser support

I imagine not all browsers will actually want to implement this API. Is the result of this, from an advertisers point of view, that serving personalised ads is not possible in certain browsers? Does this create a risk of platform segmentation in that some websites could detect non-implementation of the API and refuse to serve content altogether (which would severely limit user choice and increase concentration of a smaller set of browsers)? A mitigation for this could be to specify explicitly 'not-implemented' return values for the API calls that are indistinguishable from a full implementation.

The description of the experimentation phase mentions refreshing cohort data every 7 days; is timing something that will be specified, or is that left to implementations? Is there anything about cohort data "expiry" if a browser is not used (or only used to browse opted-out sites) for a certain period?

As always, it is up to each browser to determine which use cases and APIs it wishes to support. Returning empty lists is completely reasonable. Though a caller could still use the UA to determine if the API is really supported or not. I’m not sure that there is a good solution here.

In regards to the duration of a topic, I think that is likely to be per-UA.

In the Topics API, we ensure that each topic has a minimum number of users, by returning responses uniformly at random 5% of the time.

Opting out

I note that "Whether the browser sends a real FLoC or a random one is user controllable" which is good. I would hope to see some further work on guaranteeing that the "random" FLoCs sent in this situation does not become a de-facto "user who has disabled FLoC" cohort. It's worth further thought about how sending a random "real" FLoC affects personalised advertising the user sees - when it is essentially personalised to someone who isn't them. It might be better for disabling FLoC to behave the same as incognito mode, where a "null" value is sent, indicating to the advertiser that personalised advertising is not possible in this case. I note that sites can opt out of being included in the input set. Good! I would be more comfortable if sites had to explicitly opt in though. Have you also thought about more granular controls for the end user which would allow them to see the list of sites included from their browsing history (and which features of the sites are used) and selectively exclude/include them? If I am reading this correctly, sites that opt out of being included in the cohort input data cannot access the cohort information from the API themselves. Sites may have very legitimate reasons for opting out (eg. they serve sensitive content and wish to protect their visitors from any kind of tracking) yet be supported by ad revenue themselves. It is important to better explore the implications of this.

The current plan is for the Topics API to return an empty list in incognito mode.

Sites opt in via using the API. If the API is not used, the site will not be included. Sites can also prevent third parties from calling the API on their site via permission policy.

In regards to granular controls, we feel that this is possible with Topics (less so with FLoC) and expect to expose via UX the topics that are being returned, and allowing users to opt out of the API completely or disable individual topics.

The API is designed to facilitate ecosystem participation - as calling the API is both the way to contribute and receive value from the API. We do not want sites to be able to get topics without also supporting the ecosystem.

Centralisation of ad targeting

Centralisation is a big concern here. This proposal makes it the responsibility of browser vendors (a small group) to determine what categories of user are of interest to advertisers for targeting. This may make it difficult for smaller organisations to compete or innovate in this space. What mitigations can we expect to see for this? How transparent / auditable are the algorithms used to generates the cohorts going to be? When some browser vendors are also advertising companies, how to separate concerns and ensure the privacy needs of users are always put first?

The Topics API helps to address broad, granular topics based advertising. For more niche topics, we suggest the usage of alternative sandbox APIs like FLEDGE. In terms of transparency, the API is written plainly in open source code, the design is occurring on github with an active community, and the ML model used to classify topics will be available for anyone to evaluate.

Accessing cohort information

I can't see any information about how cohorts are described to advertisers, other than their "short cohort name". How does an advertiser know what ads to serve to a cohort given the value "43A7"? Are the cohort descriptions/metadata served out of band to advertisers? I would like an idea of what this looks like.

With Topics, the Taxonomy name is its semantic meaning.

Security & privacy concerns

I would like to challenge the assertion that there are no security impacts.

  • A large set of potentially very sensitive personal data is being collected by the browser to enable cohort generation. The impact of a security vulnerability causing this data to be leaked could be great.

In Chrome, the renderer is only aware of the topic for the given site. The browser stores information about which callers were on each top-level site, and whether the API was called. This is significantly better than the data stored for third-party cookies.

  • The explainer acknowledges that sites that already know PII about the user can record their cohort - potentially gathering more data about the user than they could ever possibly have access to without explicit input from the user - but dismisses this risk by comparing it to the status quo, and does not mention this risk in the Security & Privacy self-check.

The Topics API, unlike FLoC, only allows a site to learn topics if the caller has observed the user on a site about that topic. So it is no longer easy to learn more about the user than they could have without explicit input from the user.

  • Sites which log cohort data for their visitors (with or without supplementary PII) will be able to log changes in this data over time, which may turn into a fingerprinting vector or allow them to infer other information about the user.

Topics is more difficult to use as a cross-site fingerprinting vector due to the fact that different sites receive different topics during the same week. We have a white paper studying the impact of this: https://github.com/jkarlin/topics/blob/main/topics_analysis.pdf Logging data over time does still increase knowledge about the user however. We’ve limited this as much as we think is possible.

  • We have seen over past years the tendency for sites to gather and hoard data that they don't actually need for anything specific, just because they can. The temptation to track cohort data alongside any other user data they have with such a straightforward API may be great. This in turn increases the risk to users when data breaches inevitably occur, and correlations can be made between known PII and cohorts.

The filtering mentioned above (not returning the topic if it was observed by the calling context for that user on a site about that topic) significantly cuts down on this hoarding. It’s no longer possible for any arbitrary caller on a page to learn the user’s browsing topics.

  • How many cohorts can one user be in? When a user is in multiple cohorts, what are the correlation risks related to the intersection of multiple cohorts? "Thousands" of users per cohort is not really that many. Membership to a hundred cohorts could quickly become identifying.

There are only 349 topics in the proposed Topics API, and 5% of the time a uniformly random topic is returned. We expect there to be significantly more users per topic that there were in FLoC.

lknik commented 2 years ago

Is it possible to conduct a more formal leak-analysis?

We’ve reduced the ability to use this information as a global identifier (cross site fingerprinting surface) as much as possible.

jkarlin commented 2 years ago

Please see https://github.com/patcg-individual-drafts/topics/blob/main/topics_analysis.pdf for a more formal analysis.

jkarlin commented 2 years ago

Also, I'd appreciate your thoughts on if this API belongs in document, navigator, or somewhere else. We chose document.browsingTopics() because the topics are filtered by calling context. But perhaps it should be in navigator since it's more about the state of the user's browsing history?

hadleybeeman commented 2 years ago

Hello! We discussed this at our W3C TAG breakout.

We are adding this to our agenda for our upcoming face-to-face in London, and we'll come back to this in more detail then.

jkarlin commented 2 years ago

Great, thanks for the update. Would it be useful for me to be present/available during that time?

cynthia commented 2 years ago

Retroactively: Yes.

jkarlin commented 1 year ago

It's been awhile since our last presentation. I just wanted to bring to your attention two changes since then. Namely:

  1. Option to retrieve topics without modying state of which topics have been observed
  2. Option to send headers as part of fetch request headers. That pull request also mentions document requests, but that's still being debated.
torgo commented 1 year ago

Hi @jkarlin thanks for this. Apologies for the delay in giving further feedback. We discussed in breakout today. One concern we have is the risk that publishers might try to detect whether the user is using a browser with the topics API loaded / enabled and attempt to deny service if the API is not implemented. This is a similar issue what was discussed in your response to the security & privacy questionnaire for the API's behaviour in incognito mode. In both cases it feels like the result should be that the publisher should not be able to tell whether the topics API is disabled/not implemented. Is this the case?

jkarlin commented 1 year ago

Thanks for the question. The API will return an empty response if: the user opts out, the user cleared relevant history, the user is incognito mode, the user is new, the user is signed into Chrome in a child account, etc. As you can see, there are many reasons for which the user may have an empty response, meaning that it is not a clear signal of the user's state. Requiring that a user have topics would negatively impact a significant fraction of the site's traffic.

There are two reasons for returning an empty list instead of a random value. One is that it's much more understandable to the user that when the API is disabled/incognito/etc that the API is not sending anything about them (random or not). Second, the number of cases in which no topics are sent is not small. If we were to send random responses in those cases the signal to noise ratio would be significantly impacted.

torgo commented 1 year ago

Thanks @jkarlin for that clarification and thanks @annevk for the further info from the Webkit position. We're going to put this on the agenda again for early in the new year to discuss further

rhiaro commented 1 year ago

The intention of the Topics API is to enable high level interests of web users to be shared with third parties in a privacy-preserving way in order to enable targeted advertising, while also protecting users from unwanted tracking and profiling. The TAG's initial view is that this API does not achieve these goals as specified.

The Topics API as proposed puts the browser in a position of sharing information about the user, derived from their browsing history, with any site that can call the API. This is done in such a way that the user has no fine-grained control over what is revealed, and in what context, or to which parties. It also seems likely that a user would struggle to understand what is even happening; data is gathered and sent behind the scenes, quite opaquely. This goes against the principle of enhancing the user's control, and we believe is not appropriate behaviour for any software purporting to be an agent of a web user.

The responses to the proposal from Webkit and Mozilla highlight the tradeoffs between serving a diverse global population, and adequately protecting the identities of individuals in a given population. Shortcomings on neither side of these tradeoffs are acceptable for web platform technologies.

It's also clear from the positions shared by Mozilla and Webkit that there is a lack of multi-stakeholder support. We remain concerned about fragmentation of the user experience if the Topics API is implemented in a limited number of browsers, and sites that wish to use it prevent access to users of browsers without it (a different scenario from the user having disabled it in settings).

We are particularly concerned by the opportunities for sites to use additional data gathered over time by the Topics API in conjunction with other data gathered about a site visitor, either via other APIs, via out of band means, and/or via existing tracking technologies in place at the same time, such as fingerprinting.

We appreciate the in-depth privacy analyses of the API that have been done so far by Google and by Mozilla. If work on this API is to proceed, it would benefit from further analysis by one or more independant (non-browser engine or adtech) parties.

Further, if the API were both effective and privacy-preserving, it could nonetheless be used to customise content in a discriminatory manner, using stereotypes, inferences or assumptions based on the topics revealed (eg. a topic could be used - accurately or not - to infer a protected characteristic, which is thereby used in selecting an advert to show). Relatedly, there is no binary assessment that can be made over whether a topic is "sensitive" or not. This can vary depending on context, the circumstances of the person it relates to, as well as change over time for the same person.

Giving the web user access to browser settings to configure which topics can be observed and sent, and from/to which parties, would be a necessary addition to an API such as this, and go some way towards restoring agency of the user, but is by no means sufficient. People can become vulnerable in ways they do not expect, and without notice. People cannot be expected to have a full understanding of every possible topic in the taxonomy as it relates to their personal circumstances, nor of the immediate or knock-on effects of sharing this data with sites and advertisers, and nor can they be expected to continually revise their browser settings as their personal or global circumstances change.

A portion of topics returned by the API are proposed to be randomised, in part to enable plausible deniability of the results. The usefulness of this mitigation may be limited in practice; an individual who wants to explain away an inappropriate ad served on a shared computer cannot be expected to understand the low level workings of a specific browser API in a contentious, dangerous or embarrassing situation (assuming a general cultural awareness of the idea of targeted ads being served based on your online activities or even being "listened to" by your devices, which does not exist everywhere, but is certainly pervasive in some places/communities).

While we appreciate the efforts that have gone into this proposal aiming to iteratively improve the privacy-preserving possibilities of targeted advertising, ultimately it falls short. In summary, the proposed API appears to maintain the status quo of inappropriate surveillence on the web, and we do not want to see it proceed further.

darobin commented 1 year ago

Quick question: according to this statement and to the following thread, Google believe that they have further arguments relating to Topics that have not been taken into consideration by the TAG. Is the expectation that these arguments will be brought here so that the TAG may review them? What's more if the TAG still finds Topics problematic after these arguments are brought forth, is the plan to withdraw the proposal, or to ship no matter what?

torgo commented 1 year ago

For avoidance of doubt – the TAG has not closed this review. We fully expect to continue the discussion and review further changes with the hope that the issues we've raised above will be addressed.

michaelkleber commented 1 year ago

Hello TAG, thank you for continuing to review and we appreciate the dialogue. Your response to this proposal, and the related responses from WebKit and Mozilla, make it clear that without changes or new information, Topics in its current form is not likely to gain multi-browser support or progress along the W3C standards track. In the long term, the Privacy Sandbox effort aims to converge with other browsers on APIs that we all agree are appropriate for the Web and useful for online advertising without cross-site tracking.

In the near term, however, Chrome is unable to remove third-party cookies (expected in 2024) without making some privacy-improving replacement technologies available. The Topics API will remain part of the collection of APIs that we expect the ads ecosystem to test during 2023 — and we hope the testing feedback we hear and the implementer experience we gain will be valuable contributions in future work towards cross-browser standards work in this space, however long that takes.

Regarding your comments on the current state of the API, we have some responses and some disagreements, and we would be happy to continue the conversation. Your ideas for iteratively improving the current API will be welcome, if you wish to provide them, even if our need to balance multiple interests prevents us from adopting your recommendations wholesale. We do appreciate the feedback and will always look to it for elements we could incorporate in the meantime. We would be particularly interested in any thoughts on how to modify our API now to ease a transition to a standards approach for interest-based advertising in the future, although it may be too early for this to be a relevant design question.

We understand the TAG is busy, and perhaps you are not interested in further review or discussion on this proposal until its status materially changes (e.g. we gain multi-implementer support, we get new data on utility from ad ecosystem testing, adoption of TEE processing changes the privacy infrastructure landscape, etc). If that is the case, we look forward to picking up a version of this discussion again in the future, likely sometime after Chrome has removed 3rd-party cookies.

Alternatively, if you want to continue discussion on the details of your review, we are happy to make the case for why we continue to feel that replacing 3rd-party cookies with Topics is a tremendous step forward in web privacy despite its trade-offs.

torgo commented 1 year ago

Thanks @michaelkleber just briefly: we're definitely interested in further discussion on the details which we hope can lead to improvements in the areas we've outlined.

michaelkleber commented 1 year ago

Great, thank you @torgo, we will get back to you — may take us a few weeks.

lknik commented 1 year ago

@michaelkleber

In the near term, however, Chrome is unable to remove third-party cookies (expected in 2024) without making some privacy-improving replacement technologies available. The Topics API will remain part of the collection of APIs that we expect the ads ecosystem to test during 2023

Interesting. Could you shed some light on why is that? Is that due to the competition/etc proceedings? That said, these arguments would be non-technical, if so?

jkarlin commented 1 year ago

Hey folks. Thanks for the discussion so far. Specific responses inline:

The Topics API as proposed puts the browser in a position of sharing information about the user, derived from their browsing history, with any site that can call the API. This is done in such a way that the user has no fine-grained control over what is revealed, and in what context, or to which parties. It also seems likely that a user would struggle to understand what is even happening; data is gathered and sent behind the scenes, quite opaquely.

Note that the number of sites that can both call the API, and receive an unfiltered response, is quite small. This is because the caller would have to have observed the user on a site about that topic in the past to get through the filter. The vast majority of sites that can call the API will actually receive an empty list. For more details about this observer-based filtering, see this part of the explainer.

Both users and websites can opt out of the Topics API. Clearing any browsing history prevents those sites from affecting the user’s generated Topics. Generally speaking, UX is not part of specification discussion. That said, there is UX provided within Chrome settings to opt out of individual Topics that have been selected, and we’re looking into UX to opt out of any given topic preemptively. Your criticisms all apply to third-party cookies, but in each case Topics offers a very large step forward in understanding and control.

The responses to the proposal from Webkit and Mozilla highlight the tradeoffs between serving a diverse global population, and adequately protecting the identities of individuals in a given population. Shortcomings on neither side of these tradeoffs are acceptable for web platform technologies.

It is important to point out the underlying physics that we all must adhere to. Any proposal in this space (by any company) has some notion of a data leakage rate built in. This is true regardless of the choice of privacy mechanism. As time passes, the leakage is additive, and eventually a cross-site identifier can be derived. It’s a matter of how long it takes to get there. This point of view applies to WebKit's PCM and Mozilla + Meta's IPA proposals as well: every API here is about tradeoffs.

For the Topics API, our study suggests that it would take tens of weeks of revisiting the same two pages to re-identify the vast majority of users across those pages using only the data from the API. We consider that a substantial win in privacy compared to third-party cookies, where cross-site re-identification takes a single visit. We could make it worst case instead of average case analysis instead (and crank up the random noise), but at a trade-off with utility. These types of analysis and trade-offs are what we expect to continue tuning going forward.

It's also clear from the positions shared by Mozilla and Webkit that there is a lack of multi-stakeholder support. We remain concerned about fragmentation of the user experience if the Topics API is implemented in a limited number of browsers, and sites that wish to use it prevent access to users of browsers without it (a different scenario from the user having disabled it in settings).

We’re interested in finding solutions to the use case, especially those that garner multi stake-holder support. That said, the concerns you mention about browser fragmentation do not seem to have prevented similar privacy-related launches in Mozilla or WebKit that increased fragmentation. And a Chrome migration from third-party cookies to an API like Topics will bring browser behavior much closer together, not drive it further apart.

We are particularly concerned by the opportunities for sites to use additional data gathered over time by the Topics API in conjunction with other data gathered about a site visitor, either via other APIs, via out of band means, and/or via existing tracking technologies in place at the same time, such as fingerprinting.

If these sorts of covert tracking practices are in use, then the Topics API will not provide any new information at all — recall that any party that can recognize a person across the various sites in which the party is embedded already has a large superset of the information available to the Topics algorithm.

While extra correlations might be inferred beyond what the taxonomy provides, Topics has significantly better protections against inferring sensitive correlations, compared to third-party cookies or alternative tracking technologies like fingerprinting possible across all browsers.

Further, if the API were both effective and privacy-preserving, it could nonetheless be used to customise content in a discriminatory manner, using stereotypes, inferences or assumptions based on the topics revealed (eg. a topic could be used - accurately or not - to infer a protected characteristic, which is thereby used in selecting an advert to show). Relatedly, there is no binary assessment that can be made over whether a topic is "sensitive" or not. This can vary depending on context, the circumstances of the person it relates to, as well as change over time for the same person.

These concerns are also discussed in our explainer. In the end, what can be learned from these human-curated topics derived from pages that the user visits is probabilistic, and far less detailed than what cookies can provide with precise cross-site identifiers. While imperfect, this is clearly better for user privacy than cookies. We understand each user cares about different things, and this is why we give controls including to turn off certain topics or to turn off Topics entirely.

Giving the web user access to browser settings to configure which topics can be observed and sent, and from/to which parties, would be a necessary addition to an API such as this, and go some way towards restoring agency of the user, but is by no means sufficient. People can become vulnerable in ways they do not expect, and without notice. People cannot be expected to have a full understanding of every possible topic in the taxonomy as it relates to their personal circumstances, nor of the immediate or knock-on effects of sharing this data with sites and advertisers, and nor can they be expected to continually revise their browser settings as their personal or global circumstances change.

The UX is still evolving here but we already have the ability for users to opt out of the API, and to opt out of individual topics. I generally expect users that have sensitivities to Topics to disable the API as a whole, rather than ferret out individual concerns. You seem to be taking this discussion from the perspective that third-party cookies simply do not exist on the web and that Topics is introducing these behaviors, whereas we’re considering the substantial gain in privacy from where we are with third-party cookies.

A portion of topics returned by the API are proposed to be randomised, in part to enable plausible deniability of the results. The usefulness of this mitigation may be limited in practice; an individual who wants to explain away an inappropriate ad served on a shared computer cannot be expected to understand the low level workings of a specific browser API in a contentious, dangerous or embarrassing situation (assuming a general cultural awareness of the idea of targeted ads being served based on your online activities or even being "listened to" by your devices, which does not exist everywhere, but is certainly pervasive in some places/communities).

I wouldn’t expect users to understand that probabilistic deniability is built into the privacy technology that we use today. That said, you seem to be suggesting that personalized advertising in general is bad because someone might look over the user’s shoulder or use their computer and the user might be embarrassed. I’d note that 1) sharing a computer has far greater embarrassment potential, 2) personalized advertising comes about in many ways (1p data, contextual data, inferences, geo ip, etc) and 3) personalized advertising is often wrong today even with the much more powerful third-party cookies.

I appreciate your feedback and remain open to suggestions you might have on how the API might improve.

Edit: I meant to attribute IPA to Meta + Mozilla but accidentally omitted Meta. Fixed.

martinthomson commented 1 year ago

These types of analysis and trade-offs are what we expect to continue tuning going forward.

It seems to me like this is where there is a large disconnect. Implicit in the Topics design is an assumption that this sort of trade-off has been agreed. PATCG has talked at some length about this, but that is a very narrow slice of the larger community, and very far from representative.

Reaching this conclusion is not natural. Assuming that we all agree that a trade-off is necessary is presumptuous. In part, addressing this assumption is why PATCG is chartered to produce a document that lays out the principles by which it guided its work. The question of whether or not to entertain trade-offs is the primary reason for that work item - at least from my perspective - because none of this work to support advertising works without that work as a foundation.

It might be that specific proposals (like this one) fail on grounds that are less about the principle, but more about execution. For instance, Topics proposes a weekly release of a small amount of data, with some amount of probabilistic protection. You can model that protection as $(\epsilon, \delta)$-differential privacy with $\epsilon\approx10.4$. Maybe the disagreement is about the amount of data release or the value of $\epsilon$. But that doesn't mean that we have consensus about whether there is a trade-off in the first place.

This sounds like a subject where the TAG might add value. Not in terms of determining values for $\epsilon$ - no one can solve that problem - but in terms of convening a discussion in the community about this problem. Then perhaps we might have some better agreement about what those trade-offs might need to look like - if they exist at all.


p.s., Please don't forget Meta when it comes to IPA.

dmarti commented 1 year ago

It might be an oversimplification to consider Topics API purely in the advertising context. It could turn out to be applied more often in personalized pricing, learning management systems, law enforcement, or other areas where site operators want to categorize users by web history and can accept some noise.

michaelkleber commented 1 year ago

Hi Martin: I think we're largely in agreement here. The W3C clearly has not come to anything like agreement regarding the kind of trade-offs made in the Topics design. And indeed Privacy Sandbox includes an entirely separate proposal, FLEDGE, which makes a different set of design decisions here, for just that reason.

As I said above, "we hope the testing feedback we hear and the implementer experience we gain will be valuable contributions in future work" — and that work absolutely includes the sorts of principles discussions you're asking for.

If the TAG wants to pause this review until they / the PATCG / etc. come to some consensus on a higher-level position that would inform it, that's entirely reasonable. I feel like I already made a similar offer, and @torgo indicated a preference to keep the review going.

As @jkarlin's response indicates, we think the Jan 12 "initial view" reflects a variety of misunderstandings on specific implementation choices, and we strongly disagree with the conclusion that it "appears to maintain the status quo". That seems like worthwhile clarification, irrespective of what overarching principles we all agree to use to measure these sorts of proposals in the future.

michaelkleber commented 1 year ago

Hi @lknik, you asked about our need to offer some privacy-improving replacement technologies.

For a long discussion of this, please take a look at the blog post here: https://privacysandbox.com/news/working-together-to-build-a-more-private-internet.

From a POV more focused on web standards: as with any other non-backwards-compatible change to the web platform, we can only proceed with a deprecation and removal after considering the potential breakage. See https://www.chromium.org/blink/launching-features/#feature-deprecations for full details of the Blink process, but note for example that the guidelines ask "What is the cost of removing this feature?" and "What is the suggested alternative? There should be another way for developers to achieve the same functionality." A change that would cause most web sites to lose half of their revenue, without any privacy-improving alternative, is not compatible with our removal process.

You mention "the competition/etc proceedings", and certainly the Commitments that Google made to the UK's Competition and Markets Authority are part of our overall considerations. But our stance here always mirrored theirs: disrupting the advertising ecosystem without a reasonable privacy-improving replacement would harm too many parties — publishers, advertisers, technology providers, and people.

jkarlin commented 1 year ago

FYI a draft spec for the Topics API is available. Please let me know if you'd like me to create a separate spec review thread for it.

nightpool commented 1 year ago

Hi all, Google has publicly committed to binging the Topics API live in July 2023.[0][1]. Where does that leave this review? There has been (to my knowledge) no multi-stakeholder browser community agreement or consensus related to the Topics API. How does Chrome see this launch as fitting into their commitment to an open and standards-based web platform? Why hasn't there even been at least a public I2S thread about this feature, if Chrome has already publicly committed to shipping it in July?

cynthia commented 1 year ago

There will be an update on this soon.

As for how Chrome is treating this concerning shipping in the broader web, this is the wrong venue to ask. I would suggest asking either the chromium-dev or blink-dev mailing list about that.

RByers commented 1 year ago

There is now an I2S on blink-dev. Input on the interop risk is welcome there.

atanassov commented 1 year ago

Reading through the i2s, discussion here and minutes of our last call with Michael Kleber, a question that didn't come clearly answered to me is - how can we prevent any high-volume libs/platforms on the web result in broken interop? Feature detection will enable singling out UA missing the feature and lead to broken or at worst disabled experience. I'm sure this is covered somewhere but couldn't find it.

michaelkleber commented 1 year ago

Any UA that wants to pass feature detection but not give out information could implement the API to return an empty set of topics every time. That's the intended behavior in Chrome when the user hasn't browsed much in recent epochs, or is in incognito mode, or has disabled the API, or when the current API caller happens to be excluded from seeing this page's actual topics because of per-caller topic filtering, etc. So it seems like a safe way to avoid the interop breakage risk.

martinthomson commented 1 year ago

More information regarding the privacy aspects: Interest-disclosing Mechanisms for Advertising are Privacy-Exposing (not Preserving) and On the Robustness of Topics API to a Re-Identification Attack. My initial read of both suggests that these could be construed either as vindicating Google's position or as repudiating it, depending on your perspective.

I'll reiterate points I've made before: the effect of platform changes on aggregate metrics only makes sense to the extent that the privacy impact is uniform or approximately uniform. Topics is very specifically individualized and so appears to be far from uniform (a position supported by both papers). Consequently, though the privacy impact of the API for most people might be modest, there are some for whom the effect is significant.

jkarlin commented 1 year ago

We are actively involved with the research community, presenting our research on the privacy properties of the Topics API in papers, reports, and workshop presentations. We are happy to see more external members of the research community engaging with this area.

We're protecting users against general tracking on the web by making it difficult or expensive to track users at scale. These papers show that we're successfully doing so with the Topics API. While the information gained per topic is not uniform, it is so much better than where we are today with third party cookies that we feel that it's a great step forward in protecting users while also funding the sites that they enjoy visiting.

plinss commented 1 year ago

The following comment has come out of TAG discussions this week:

First of all, thanks to @martinthomson for those pointers to two relevant papers.

We've continued to discuss this API across several calls this week. @cynthia also demonstrated the current implementation.

We remain concerned about the points recently raised about interop. Especially given the lack of multi-stakeholder buy-in for this API, how can we really protect against a future where advertising based sites tell users they must switch to a browser that implements Topics? @michaelkleber you've said "Any UA that wants to pass feature detection but not give out information could implement the API to return an empty set of topics every time" however that still implies other UAs would be required to implement the API (at least minimally) when they might not otherwise do so, in order to mitigate privacy harms for their users - so there is a risk here.

We remain concerned about the ability of users to give meaningful consent for their interests to be calculated and tracked from their browsing activity. The spec says:

suggestion that user agents provide UX to give users choice in which Topics are returned

and refers to a "user preference setting" in several places.

We have inferred from this that users are able to disable particular topics in the settings, or the API as a whole, but we don't think that either of these potential configuration options are good enough to protect against potential privacy harms, particularly for marginalised groups. A person's status as vulnerable, at-risk, or marginalised can change over time, and this isn't something most people are necessarily aware of or paying attention to in their day-to-day web use, and nor is it reasonable to expect people to regularly review their browser settings with this in mind. Thus, "opt out of individual topics" is not sufficient to offer meaningful consent to being tracked in this way. Further, from what we have seen of the API as implemented so far, there are no user preference settings relating to specific individual topics. We raised this in our initial review, and don't feel it has yet been considered with the depth warranted.

This issue intersects with others, for example, as pointed out in the Webkit review that the topics list represents a western cultural context, and that the mechanism for sites being classified according to these categories is unclear. We understand from the spec that site classification is automated, based on the domain, but the mechanism for doing this remains opaque, and it is not clear there is any recourse for sites which are misclassified.

We saw in the current implementation that sites in a user's browsing history which do not call the Topics API were being classified under particular topics. We had been led to believe that sites opt-in to being classified by calling the API ("Sites opt in via using the API. If the API is not used, the site will not be included." in the initial review request), but perhaps we misunderstood, or this has changed. The spec refers to "site opt outs", although we weren't able to find how they do this in the spec (please could you point us to the right place if we missed it?).

Questions:

cynthia commented 1 year ago

As API-surface feedback was also promised on "document, navigator, or somewhere else", adding that to the review comment above. We briefly discussed this, and the current thoughts on where the API belongs are somewhat inconclusive.

While navigator might sound logical given that it will be exposing a lossy representation of the browsing history, this also implies it is global to the user agent - I'm not sure how that would hold in the long term. If there is a necessity to change the behavior so that the API is contextual (e.g. different topics based on the caller's origin), it would definitely be out of place. Also, there are a lot of things somewhat unnecessarily hanging off of navigator, so bloat would be another reason.

This leaves document as the natural location for access via the browsing context. One question on the API surface would be whether there would be a reason to access topics from a worker (e.g. for background/off-thread/SW-based bidding), in which case you would probably want to expose it to WorkerGlobalScope as well. We don't know if it would be a critical use case, but if the ad tax in the main thread can go down as a side effect of this, it would be worth considering.

hadleybeeman commented 1 year ago

Hi all. We've looked at this during our W3CTAG f2f. We are still hoping for replies to our previous two comments from @plinss and @cynthia. Any thoughts?

siliconvoodoo commented 1 year ago

Let's sum this up in very lay man terms: Topics = google money. It's not in users interest, nor should it be at the agenda of a moral society. We, the people, want an integrally anonymized internet. If your business model can't survive because you can't monetize on the back of the data of your visitors, go do something more useful for society. Stochastic plausible deniability is whitewashing of an otherwise dystopian behavior. Pretension that "studies" demonstrated a desire from users to have targeted ads, is just done on the back of uneducated respondents about the risks of identifiability, and freedom of the web in general. And an "improvement from cookies" is just a sophistry as explained by brave's devs on their blog, I quote

Google claims that these systems, [...], improve privacy because they’re designed to replace third-party cookies. The plain truth is that privacy-respecting browsers [...] have been protecting users against third-party tracking (via cookies or otherwise) for years now.

Google’s proposals are privacy-improving only from the cynical, self-serving baseline of “better than Google today.” Chrome is still the most privacy-harming popular browser on the market, and Google is trying to solve a problem they introduced by taking minor steps meant to consolidate their dominance on the ad tech landscape. Topics does not solve the core problem of Google broadcasting user data to sites, including potentially sensitive information.

jkarlin commented 1 year ago

Thanks for the feedback. I’ve added responses to both plinss and cynthia below:

Do you have a response to the points raised in Webkit's review?

They are similar in nature to what has already been brought up by TAG and discussed in this thread. If there are particular questions I’d be happy to respond.

Do you have any analysis or response to the papers that Martin pointed to?

Yes, please see my previous comment. To add to that, I think it’s important to understand that all of the papers are using different data sets with different modeling assumptions on evolution of user interests, number of users present etc. Our own research utilized real user data, while the others understandably had to generate synthetic web traces and interests, which Jha et al. notes may not be representative of the general population. Nonetheless, they all found that it took a large number of epochs to reidentify the majority of users across sites.

Please could you elaborate if it is in fact the case that all sites browsed by a user are included by default as input data for generating a user's topics list? If this is the case, what recourse is there for sites which are misclassified?

This is not the case. Only sites that call the API are included as input to generating the user’s topics list.

Can you clarify the situation with regard to definition of user preference / opt out?

Users can opt out of the API wholesale within Chrome's privacy preferences. They can also disable topics that have been selected. In the future, they will be able to preemptively remove topics.

Sites can choose not to use the API, in which case user visits to their site will not be included in topics calculation. Sites can further ensure that nobody on their site calls the API via permission policy.

Have you considered dropping the part where topics are calculated from browsing history, and instead entirely configured by the user in their browser settings? This would be much closer to people being able to meaningfully opt in to targeted advertising, and would make several of the other concerns raised moot.

It’s been raised in our public meetings. Folks have raised multiple issues with such an approach. One is that user interests are dynamic, whereas settings are generally quite static. A second is that it seems like many users might not bother to configure this, even if doing so would improve their ads and the revenue of the sites they visit.

This leaves document as the natural location for access via the browsing context. One question on the API surface would be whether there would be a reason to access topics from a worker (e.g. for background/off-thread/SW-based bidding), in which case you would probably want to expose it to WorkerGlobalScope as well. We don't know if it would be a critical use case, but if the ad tax in the main thread can go down as a side effect of this, it would be worth considering.

Excellent, thanks for that guidance. It seems reasonable to expose the API to WorkerGlobalScopebut I don’t think it would alleviate any main thread costs, as the browsingTopics call itself is asynchronous and efficient. If developers start to ask for it, then we can consider adding it more seriously.

siliconvoodoo commented 1 year ago

What happens when one visits the China embassy website, they decide they don't like your topics too much and make your visa obtention difficult or impossible? Or USA for that matter, it regularly happens https://techcrunch.com/2019/09/02/denied-entry-united-states-whatsapp/.

jkarlin commented 1 year ago

@siliconvoodoo your hypothetical doesn't make sense. If the authorities were looking at your browser, surely they would be far more interested in your actual browsing history (readily available in the browser) than your topics? And if you cleared your history, then your topics would be cleared too.

Edit: Ah, I was looking at the article you linked to about phones being scanned and missed the first part about the website. In the website case, said website would a) have to have a third-party on it that observed you on such a site and is willing to share that information, b) that topic could very well be noise, c) the taxonomy is coarse grained with highly sensitive topics removed, and finally, compared to third-party cookies (what Chrome is trying to deprecate), topics conveys tiny amounts of information.

dmarti commented 1 year ago

@jkarlin Governments have a limited number of secret police hours to work with. Not all citizens and visitors can be fully observed at all times. Governments will be able to use a lightweight remote screening system like Topics API to identify people for further, more resource-consuming attention like a full device search. Clearing Topics API data or using a browser without Topics API turned on could also be a factor in selection. And the set of possible callers is big enough that we don't know in advance which callers will be owned by, or have a data sharing agreement with, which governments.

The Topics API taxonomy is free of obvious sensitive topics, but can still encode sensitive information (such as people who like music A and food B in country C)

siliconvoodoo commented 1 year ago

@jkarlin Your argument is trying to justify gas burning because coal is worse. When I'm telling you to go nuclear. It's a sort of tu quoque fallacy. The problem is systemic, don't compartment it in pieces to find ad-hoc ways to give incompatible whataboutisms in each case. Surely you must understand that authority directly looking at your device is one situation, which must be fought, a la Apple versus FBI case. But not the one I'm worried about with Topics, that would be remote mass profiling. The surface of attack against individuals just keep being magnified, third party cookies is not a standard of reference, as the brave blog explained. There are enough NGO to alert about our predicaments, Big Brother Watch, Quadrature du net, Snowden; fictions: black mirror, brave new world... I can't understand how you can willingly be participating in implementing pathways that enable dystopias, instead of pushing for a society with more safety nets against what's coming. Why are you not aiming at tor-like anonymity for all? No cookies, no Topics, fingerprinting jamming, IP spoofing... Surely you've noticed alt-right horrors becoming mainstream, you must be able to picture what fascists power houses a la 1984 are becoming enabled to do with all the technology that we provide them? Immigration officers are not motivated to take jobs because they have nothing else to do, it's because they love the power to be nationalist right wingers and deny brown skin people entry on fake excuses. In Russia it will be because you have gay topics. In China, because you visited uyghurs activists sites. In Iran because you have feminists interests... They don't have to access any device, they will have your profile on the database, gathered and refined any time you visit an end point controlled by the agencies. The more instruments you provide the more fascist the society veers, the more risk you expose us to citizen scores, unjust incarceration, visa denials, lynching, executions or worse.

jyasskin commented 1 year ago

There are two sides of the Topics API: the interface it exposes to pages to tell them what topics a user is probably interested in, and the interface it exposes to users to figure out or guess what topics they're actually interested in. The interface with pages is the traditional realm of web standards and involves a bunch of tradeoffs around the rate that pages can identify users, which @martinthomson has focused on above.

On the other hand, the interface with users is not generally something that we standardize or specify, instead giving user agents wide freedom to do what's best for their users, even if that's very different from what other UAs do. There are some limits here—if pages need to adapt to particular UI, it may be worth constraining the variation—but I don't think Topics falls into that category, and I suspect that the Topics spec actually has too much normative text specifying the user-facing part of its behavior.

Unfortunately, a large fraction of the TAG's review that @plinss recounted focuses on the particular UI that Chrome plans to ship, rather than the question of whether UAs have the freedom to do the right thing for their users. The TAG suggests that many users would appreciate if their interests were "entirely configured by the user in their browser settings", and I agree. As far as I can see, this UI is completely supported by the Topics API and would require no changes to the page-facing API or page behavior. Whether or not Chrome initially ships that UI, other browsers can do so, and Chrome could switch to it in the future. If I'm wrong, and that UI would require changes to the page-facing API, that would be a really good thing to point out soon, so that Chrome can ship a more-compatible API instead.

chrisvls commented 1 year ago

There are a few places where some of the assurances described in the beginning of this TAG discussion (quite a while ago now!), and even some more recently, don't quite track what is in the spec.

To return to a moment to the assertion that a billion topics would mean no privacy loss because only five may be eligible for reporting out to sites.

Finally one question one might ask: why comment on a spec when there seems so small a chance of cross-browser implementation?

As an enterprise customer of Google Workspace and Chrome, I am already subjected to small, creeping changes to the interpretation of the terms of service – and updates to those terms that are difficult to opt out of. So, even if Chrome is the only full implementer, I would rather see the critical privacy promises in a draft spec so that they stick for longer.

Also, it is really important that implementations match their marketing. There are big implications for the web as a whole if the most popular browser can market a feature as "only local calculation of coarse-grained topics" when we decide to opt in, but then, since they don't think it is a big deal, change that over time.

shivanigithub commented 1 year ago

FYI, Chrome plans to start gating topics API invocation behind the enrollment and attestation mechanism. (explainer, spec PR)

plinss commented 6 months ago

To summarize and close this review, we note that there are some disagreements about goals here that underpin the disconnect.

The goals you have set out in the explainer are:

  • It must be difficult to reidentify significant numbers of users across sites using just the API.
  • The API should provide a subset of the capabilities of third-party cookies.
  • The topics revealed by the API should be less personally sensitive about a user than what could be derived using today’s tracking methods.
  • Users should be able to understand the API, recognize what is being communicated about them, and have clear controls. This is largely a UX responsibility but it does require that the API be designed in a way such that the UX is feasible.

The set of goals also implictly compares the privacy characteristics of this API to the web with 3rd party cookies (and tracking). In the spirit of "leaving the web better than you found it," we would like to see the design goals achieved whilst also preserving the privacy characteristics of the web without third party cookies.

We do acknowledge that you have arguably achieved the 4th goal, with an API that does not actively prevent the user from understanding and recognizing what is being communicated about them. However the implicit privacy labour that would be required to manage this set of topics on an ongoing basis remains a key question.

Finally, we challenge the assertion that reidentification in the absence of other information is the right benchmark to apply. As we previously noted, the potential for this to affect privacy unevenly across different web users is a risk that is not adequately mitigated.