[Spec] Provide reasonable limits on the data that can be included in topics classification input

chrisvls commented 1 year ago

Site developers need to make decisions about when to call the Topic API. Site developers face different statutory and business requirements when their site grants third-parties access to data that may contain PII, indicators of sensitive topics, or cross-site or other identifiers.

To address these concerns today, site developers use technical means to ensure that the use of third-party cookies does not expose such information to third parties. In the alternative, when they chose to have the site send sensitive data to third parties, they face statutory and other requirements, including signing a contract with the third party, disclosing the identity of the third party to their users, and other requirements.

As currently spec'd, the Topics API makes it difficult for the site developer to fulfill these obligations. Section 6 of the spec places no specific limits on the Topics calculation input data. This makes it very difficult for the site developer to use technical means to control the data exposed to the browser vendor. The spec could also be read as exposing data that developers have always assumed would be private only to authorized origins, like cookie or local storage data. This would require the site developers re-evaluate past choices about where data resides. Also, it is not practical to use the normal alternative — contractual arrangements — to cover these requirements, as the site would need an agreement with all implementing browsers.

Ideally, the spec adds reasonable limits on the data exposed to the browser vendors, so that the data included in the topics calculation input data:

is easily understood by the site developer and, in turn, the user
has a limited default scope, such host names and page metadata
is optionally expanded or limited by the developer
does not contain data from local storage or cookies unless specifically allowed

chrisvls commented 1 year ago

@michaelkleber Here you go... thanks.

michaelkleber commented 1 year ago

Thanks @xyaoinum, the addition in #212 looks good to me. @chrisvls does this meet your goals?

dmarti commented 1 year ago

Why document's [=Document/URL=] and not the [=host=] of a {{Document}}'s [=Document/URL=]?

Full URLs can contain just as many informative and possibly sensitive keywords as page titles, among other reasons because it's an SEO best practice to put keywords in the URL path.

Related: https://github.com/patcg-individual-drafts/topics/issues/118 (closed, decided not to use page title)

michaelkleber commented 1 year ago

The question of what information should be used in figuring out a page's topics is a very interesting one, where we don't feel like there is only one "right answer". Rather, it's a case appropriate for experimentation by different implementers with different algorithms. Chrome's current behavior is to just use the host, and even though we decided not to use page title by default right now, it's a good subject of ongoing thinking and discussion.

But Chris was looking for some outer-limit constraints that we can put in the spec and that we think should apply to all implementations, particularly to address the fear that some implementation might use information that seems unreasonable, like stuff beyond the current page. I think Yao's change does a good job of making clear what space of ideas seems worth exploring vs. what seems out of bounds.

chrisvls commented 1 year ago

Thanks @xyaoinum and @michaelkleber ... I do think that scoping the input data to the URL and metadata is reasonable.

As raised in #188, these may include sensitive data in some cases. But they are under the control of the site developer, which would allow the developer to choose not to call the API on a page where that would be bad. From a general PII perspective, this is very helpful.

That said, I don't know the general publisher use case that well. Do publishers need to be able to broadcast sensitive keywords for SEO but suppress exposing them to the Topics API? Perhaps @dmarti or @jkarlin can weigh in, per the discussion in #118 as to whether this should go further to allow the site to explicitly grant or deny data beyond hostname, etc.

The change has two interesting qualifiers... "by default" and "unless specifically allowed"... in the parlance of the spec, I am not sure what would override the default or who would be doing the allowing. I think from the general privacy case, the propose change is fine as-is. If the user chooses to override the default or grant a specific permission, that is their business. If the site does not wish to broaden access, then that is under the site's control.

Thanks!

patmmccann commented 1 year ago

I think 118 was closed bc, for example, a cross-origin frame typically does not have access to any of the document metadata unless a publisher is to provide it to the third-party origin, eg via a post message into the frame or via query string parameters.

It seems by extension the concerns that led to the closing of 118 can easily be addressed by simply ensuring publishers are opting into the extra information being passed to the topics caller, so they are not receiving any net new information from the current state of the internet, as it appears @chrisvls suggests above.

If a permission policy sets the access to the incremental information, then we don't have to worry about what information is leaked to callers, because we know they would all be explicitly green-lighted, and no caller would get access to information they may not have today through a similar green-lighting of information sharing by the publisher.

patcg-individual-drafts / topics

[Spec] Provide reasonable limits on the data that can be included in topics classification input #211