patcg-individual-drafts / topics

The Topics API
https://patcg-individual-drafts.github.io/topics/
Other
625 stars 235 forks source link

The Topics API

This document is an individual draft proposal. It has not been adopted by the Private Advertising Technology Community Group.


With the upcoming removal of third-party cookies on the web, key use cases that browsers want to support will need to be addressed with new APIs. One of those use cases is interest-based advertising.

Interest-based advertising (IBA) is a form of personalized advertising in which an ad is selected for the user based on interests derived from the sites that they’ve visited in the past. This is different from contextual advertising, which is based solely on the interests derived from the current site being viewed (and advertised on). One of IBA’s benefits is that it allows sites that are useful to the user, but perhaps could not be easily monetized via contextual advertising, to display more relevant ads to the user than they otherwise could, helping to fund the sites that the user visits.

Specification

See the draft specification.

The API and how it works

The intent of the Topics API is to provide callers (including third-party ad-tech or advertising providers on the page that run script) with coarse-grained advertising topics that the page visitor might currently be interested in. These topics will supplement the contextual signals from the current page and can be combined to help find an appropriate advertisement for the visitor.

Example usage to fetch an interest-based ad:

// document.browsingTopics() returns an array of up to three topic objects in random order.
const topics = await document.browsingTopics();

// The returned array looks like: [{'configVersion': String, 'modelVersion': String, 'taxonomyVersion': String, 'topic': Number, 'version': String}]

// Get data for an ad creative.
const response = await fetch('https://ads.example/get-creative', {
  method: 'POST',
  headers: {
    'Content-Type': 'application/json',
  },
  body: JSON.stringify(topics)
});
// Get the JSON from the response.
const creative = await response.json();

// Display ad.

The topics are selected from an advertising taxonomy. The initial taxonomy (proposed for experimentation) will include somewhere between a few hundred and a few thousand topics (our initial design includes ~350 topics; as a point of reference the IAB Audience Taxonomy contains ~1,500) and will attempt to exclude sensitive topics (we’re planning to engage with external partners to help define this). The eventual goal is for the taxonomy to be sourced from an external party that incorporates feedback and ideas from across the industry.

The topics will be inferred by the browser. The browser will leverage a classifier model to map site hostnames to topics. The classifier weights will be public, perhaps built by an external partner, and will improve over time. It may make sense for sites to provide their own topics (e.g., via meta tags, headers, or JavaScript) but that remains an open question discussed later.

Specific details

Privacy goals

The Topics API’s mission is to take a step forward in user privacy, while still providing enough relevant information to advertisers that websites can continue to thrive, but without the need for invasive tracking enabled via existing tracking methods.

What does it mean to take a step forward in privacy for Interest Based Advertising?

  1. It must be difficult to reidentify significant numbers of users across sites using just the API.
  2. The API should provide a subset of the capabilities of third-party cookies.
  3. The topics revealed by the API should be less personally sensitive about a user than what could be derived using today’s tracking methods.
  4. Users should be able to understand the API, recognize what is being communicated about them, and have clear controls. This is largely a UX responsibility but it does require that the API be designed in a way such that the UX is feasible.

Meeting the privacy goals

We expect that this proposal will evolve over time, but below we outline our initial interpretation of how we think this proposal relates to our stated privacy goals.

  1. It must be difficult to reidentify significant numbers of users across sites using the API alone.
    1. In order to meet this goal, it is important that only a trace amount of information that could be used for cross-site reidentification is returned. In other words, the data must be rate-limited. The Topics API uses three mechanisms to slow the rate of leakage:
      1. Different sites will receive distinct topics for the same user in the same week. Since someone’s topic on site A usually won’t match their topic on site B, it becomes harder to determine that they’re the same user.
      2. The topics are updated on a weekly basis, which limits the rate of information dissemination.
      3. And finally, some fraction of the time, a random topic will be returned for a given site for that week.
    2. Our initial analysis shows that the above mechanisms are effective. We expanded on this initial analysis in a peer-reviewed research paper appearing at SIGMOD 2023 where we formally study the risk of cross-site tracking in the Topics API.
  2. The API must not only significantly reduce the amount of information provided in comparison to cookies, it would also be better to ensure that it doesn’t reveal the information to more interested parties than third-party cookies would.
    1. In order to be a privacy improvement over third party cookies, the Topics API caller should learn no more than it could have using third-party cookies. This means the API shouldn’t inform callers about topics that the caller couldn’t have learned for itself using cookies. The topics that a caller could have learned about using cookies, are the topics of the pages that the caller was present on with that same user. This is why the Topics API restricts learning about topics to those callers that have observed the user on pages about those topics.
    2. Note that this means that callers with more third-party presence on sites the user visited will be more likely to have topics returned by document.browsingTopics().
    3. Also note that this implies that the API will generally be invoked from inside a third-party iframe, rather than inside the main frame of a page.
  3. The topics revealed by the API should be significantly less personally sensitive for a user than what could be derived using existing tracking methods.
    1. Third party cookies can be used to track anything about a user, from the exact urls they visited, to the precise page content on those pages. This could include limitless sensitive material. The Topics API, on the other hand, is restricted to a human curated taxonomy of topics. That’s not to say that other things couldn’t be statistically correlated with the topics in that taxonomy. That is possible. But when comparing the two, Topics seems like a clear improvement over cookies.
  4. Users should be able to understand the API, recognize what is being said about them, know when it’s in use, and be able to enable or disable it.
    1. By leveraging a human-readable taxonomy of topics, people can learn what is being said about them. They can remove topics they specifically don’t wish to see ads for, and there can be UX for informing the user about the API and how to enable or disable it. Further, topics are cleared when history is cleared.

Evolution from FLoC

FLoC ended its experiment in July of 2021. We’ve received valuable feedback from the community and integrated it into the Topics API design. A highlight of the changes, and why they were made, are listed below:

Privacy and security considerations

We consider the API to be a step toward improved user privacy on the web. It is, however, not perfect:

Open questions

This proposal benefited greatly from feedback from the community, and there are many ways to provide feedback on this proposal and participate in ongoing discussions, including responding on the linked issue in the repository or in a W3C group such as the Improving Web Advertising Business Group or the Private Advertising Technology Community Group. Some issues that we’d like to discuss:

  1. Should sites be able to set their topics, or should topics be determined by the browser or some third-party entity?
    1. If the client does it, where does the ML model come from? What data is it trained on?
    2. If the sites do it, might they pollute the algorithm by setting the topic to the most valuable?
  2. What should happen if a site disagrees with the topics assigned to it by the browser?
    1. Should there be a way to alter the assignment?
    2. Does mislabeling cause harm?
  3. What topic taxonomy should be used? Who should create and maintain it? How many topics should the taxonomy contain?
    1. Eventually it would be good if this was produced externally to the browser and became an industry standard.
    2. The taxonomy should be publicly available for transparency.
    3. If the number of topics increase, we’ll need to balance that with the ability of sites to observe topics (e.g., if there are more topics, there is less of a chance that an ad-tech has seen the chosen topic in the past).
  4. What standard might be used for determining which topics are sensitive?
    1. Should they be regional?
  5. How might the browser detect abusive usage of the API to keep the topic dissemination rate in line with expectations?
  6. Should sites receive historic topics every visit, or first visit only?
    1. For sites that users frequently visit there is no difference in privacy. For infrequently visited sites, this becomes a trade-off between topic dissemination rate and utility.
    2. How might one define “first visit”?
      1. It could be: does the site have any cookies or other storage for the user? If so, it’s not first visit.

This document is an individual draft proposal. It has not been adopted by the Private Advertising Technology Community Group.