patcg-individual-drafts / topics

The Topics API
https://patcg-individual-drafts.github.io/topics/
Other
630 stars 241 forks source link

Site-seeded topics #50

Open igrigorik opened 2 years ago

igrigorik commented 2 years ago

The topics will be inferred by the browser. The browser will leverage a classifier model to map site hostnames to topics. The classifier weights will be public, perhaps built by an external partner, and will improve over time.

As others have already pointed out, this poses a challenge for sites that may not have descriptive hostnames, or span a wide array of topics under the same hostname (e.g. a publisher covering sports, business, entertainment, etc), a merchant with a large catalog of items (home goods, clothing, etc), and so on. Given that the current proposal considers hostname, not just domain name, this might create pressure for sites to adopt more subdomains to help with classification (e.g. sports.pub.com, homeware.shop.com, ...), but that's a costly undertaking with its own side effects.

Separately, there are open questions on misclassification (https://github.com/jkarlin/topics/issues/2) and the ability to set (https://github.com/jkarlin/topics/issues/1) topics.

My hunch is they're all semi-related and we could, perhaps, try to address them by enabling sites to "seed" a set of suggested topics. Going down this route would effectively translate the current proposal into a weakly-supervised classifier model: it doesn't make strict guarantees about the outcome of the classification but allows the site to influence and provide input signals.

More concretely, the rough model here could be...

  1. Site suggested topics MUST be from a set of valid topics
  2. Site suggested topics MAY differ across pages of the hostname

By restricting suggested topics to the predefined list we're not any new labels/segments, etc. At the same time, enabling sites to provide page-level scoped topics would, I think, address the challenge for multi-topic sites. For example, a publisher or merchant could advertise relevant topics for each section of their site (which paths and pages get which topics is controlled by the site owner). Downstream, the browser can introspect the pagel-level browsing history of the visitor, build an aggregate count of observed topics by the visitor, apply its own filters/validation, and feed the resulting set as input into the classifier model.

As noted above, this makes no strict guarantees about the final output of the classification, but it enables the site to make suggestions, browser to audit/filter suggestions, classifier to act on suggestions. The net result is that a reader who spends most of their time on the sports section of pub.com, or a buyer on the homewares section of a large merchant, might then receive a relevant classification for the {site, user} tuple.

jkarlin commented 2 years ago

Hi Ilya. To be clear, I'd like to find a way for sites/ad-tech to provide better labeling than the browser can automatically determine. The primary concern being that the site/ad-tech might misuse the labels (e.g., give them a different meaning than what is intended, which would limit user controls) or attempt to increase the value of their users by adding unrelated topics.

To be fair, it's possible to add unrelated topics by adding a carefully crafted subdomain today, but that seems relatively unlikely to happen on most sites. Whereas adding a line of javascript for a publisher is more likely.

The greater the feature-set for the classifier to observe (e.g., domain, path, query params, page content), the better a job it can do. But we run into privacy concerns.

One option is to listen for a signal from the page to signal that it's okay to read the full path or perhaps some content of the page when determining topics. That is, the content of the page isn't user sensitive. In that case we could just let the browser do a better job of classifying that page on its own, or it could take topics as input from the page directly (and perhaps ensure that they're similar to the topics it itself determines before accepting them).

But what is the incentive for a given publisher to provide more data than they absolutely have to? I can see the ad-tech on a page wanting to do so, but not the publisher. The large first parties which run their own ads wouldn't be likely to add this data.

martinthomson commented 2 years ago

it's possible to add unrelated topics by adding a carefully crafted subdomain

Isn't that a security problem? My understanding was that prohibiting sites from choosing topics was necessary as it is challenging to ensure that sites are both honest and consistent in their choice. As you note, there is no value to a site in providing accurate topics, that benefit is realized by every other site except the current one.

My thought was that you would have to limit the model input to the registrable domain for those reasons.

jkarlin commented 2 years ago

My thought was that you would have to limit the model input to the registrable domain for those reasons.

Due to the fact that there isn't a large incentive, I don't see publishers changing their subdomains in this way. Too much friction to be worth it. If it comes to it, we could consider restricting to etld+1.

igrigorik commented 2 years ago

One option is to listen for a signal from the page to signal that it's okay to read the full path or perhaps some content of the page when determining topics. That is, the content of the page isn't user sensitive. In that case we could just let the browser do a better job of classifying that page on its own, or it could take topics as input from the page directly (and perhaps ensure that they're similar to the topics it itself determines before accepting them).

👍🏻

michaelkleber commented 2 years ago

If we have an on-device model that uses path, query params, or page content, then it's not hard to imagine a page modifying those to result in any particular set of topics. I don't see a reason to make pages jump through such hoops to push the model to some topic assignment; a simple API to express what topic it claims it's about gets us to the same place with much less work.

I like the thinking behind both Ilya's "allows the site to influence" and Josh's "perhaps ensure that they're similar to the topics it itself determines". It seems like this is leaning in the direction of two outputs from the on-device classifier: one set of "best guess topics" that could be the domain classification, like today, and another larger set of "plausible topics" that could supplant the best-guess ones if the page asserted them.

This would still leave the browser in a position where we can expect the topics to mean what they say — an ad tech wouldn't be able to use "44 /Arts & Entertainment/Opera" as secret code for "Porn" because the "plausible topics" filter would usually reject it.

It would leave us vulnerable to a more subtle cross-site communication attack: If the plausibility of "137 /Computers & Electronics/Networking" and "136 /Computers & Electronics/Network Security" are highly correlated, then during a particular week, an ad tech could have the pages on some site assert either 136 or 137 depending on bit 1 of a user's ID, and then hope to read that bit on other sites the next week. This would be slow and cumbersome, but I think that is the type of risk that comes with any scheme where pages can influence the topics derived from page visits.

jkarlin commented 2 years ago

@michaelkleber if you're going to limit the page's topics to what the classifier comes up with anyway, then why not just fall back to the concept of a signal from the publisher that it's okay to include data from the url or body when determining topics? Is it preferable to allow the pub to specify a subset of those topics? Seems like it'd be quite a bit more work to actually provide the topics for the pub, and risks the chance of being slightly wrong (e.g., lists a related topic at a different level in the hierarchy than the one found by the classifier). This also helps to limit the attacks you describe.

michaelkleber commented 2 years ago

For a publisher who puts effort into assigning labels, they will surely out-perform the classifier we ship, even if we're explicitly told we can use page content. (To take an extreme example, we're not about to start analyzing video on-device for topic extraction!) And if a pub does want to tell the browser their own labels, then it doesn't make any sense for the channel to be "modify the query params of the URL so that our ML model guesses what you want it to."

I suppose you're thinking about the publisher who doesn't want to assign topics themselves, but does want the browser to do a better job than a per-domain-name ML model? Sure, I get that, and I don't object to supporting it. But if we do, then surely it will be possible for a publisher to coerce the ML model into producing the labels they want. That forces me to conclude that (1) we need a system that is robust to abuse along those lines, and (2) we might as well make it easy for them to express their intent, instead of making it an ugly hack.

jkarlin commented 2 years ago

For a publisher who puts effort into assigning labels, they will surely out-perform the classifier we ship, even if we're explicitly told we can use page content. (To take an extreme example, we're not about to start analyzing video on-device for topic extraction!) And if a pub does want to tell the browser their own labels, then it doesn't make any sense for the channel to be "modify the query params of the URL so that our ML model guesses what you want it to."

I was operating under the assumption that the pub was only allowed to suggest topics that the browser's classifier agreed with from your earlier comment. "I like the thinking behind both Ilya's "allows the site to influence" and Josh's "perhaps ensure that they're similar to the topics it itself determines"."

In which case, there isn't much value add to the pub suggesting its own topics, except to remove topics that the classifier is wrong about.

michaelkleber commented 2 years ago

Ah, got it. There are two scenarios where I think there is benefit from the combination of an on-device "plausible topics" classifier and publisher suggestions:

  1. A "plausible topics" model could be much more lax, perhaps emitting many topics with much lower confidence on each.

  2. A "plausible topics" model could emit topics that apply to the entire site, and leaving the publisher-chosen topic to handle the fact that different pages on the site are about different things.

These are in a sense both covered by "remove topics that the classifier is wrong about" — but that's a powerful change because it lets the model be wrong in ways that we don't need to worry about.

dmarti commented 2 years ago

On a large, multi-contributor site, the contributor and the site administrator are different. The contributor chooses what topics to cover, and the site owner doesn't know in advance what topics will be on a contributor's blog, newsletter, or channel.

The publisher-chosen topics on that kind of site could be the topics that the contributor tells the site owner they're covering, but that list of topics might be out of date or less accurate than the classifier could do.