Open dmarti opened 1 year ago
This doesn't actually address the privacy concerns from #118. Further, it picks a single site (a rather arbitrary heuristic) as opposed to applying equally web wide, which doesn't seem particularly webby. Finally, due to filtering, there would be some benefit to all from this (global top topic selection being more refined) but one would still have to observe the user on some page with that topic in order to receive it.
I agree that it's suboptimal to treat a single site as a special case. But as long as there is no more general approach to the YouTube problem being pursued, this would be better than nothing. Possibly other very large sites that also cover all or most topics could be special cased as well.
I think this feature request should be interpreted as something like: "For some browser-chosen list of Special Topic Provider Sites, pages on those sites should be able to declare what Topics they are about, and those become available to everyone, as if every Topics caller had observed them. And also YouTube should be on that list." In this sense it's more like a restricted version of #1 than of #118.
I don't know that I agree with this proposal! — no idea whether YouTube would be interested in being a Special Topic Provider, no idea how we would determine what other sites should have the same special status, etc. But this version seems "tricky and subtle" rather than "impossible".
@michaelkleber That makes a lot of sense. The list doesn't have to be browser-chosen.
I have rewritten the text of this issue to cover Special Topics Provider Sites, as @michaelkleber suggested. This seems like a possible path forward considering that #118 was closed, and that there still appears to be interest in fairly classifying content from large, multi-topics sites. See p. 7 of CMA update report on implementation of the Privacy Sandbox commitments, April 2023
I think you can achieve the same effect with a default ()
permission policy that declares that the page would like to include something other than domain in its topics rather than needing to make changes to enrollment.
Don, I see you're still hoping that the browser does the work of turning the "section or channel name" into topics, rather than letting the STPS just declare the page's topics directly. Is that distinction important to you?
It seems to me that the way to turn a YouTube channel name into a Topic could be very different from how you turn a hostname into a Topic. So it feels like this version of the proposal implicitly asks browsers to build a specialized STPS-to-Topics model for each Provider Site.
On the one hand, that seems like putting the work in the wrong place: Surely the site is in a good position to do a better job! On the other hand, you might worry that an STPS would be able to abuse this by maliciously giving out the wrong topics — but if you're letting them control the "section or channel name" input and the model is public, then surely it would be easy for them to maliciously push false topics either way.
Hi @michaelkleber -- I don't know. On one hand, it seems like the choice of whether or not to allow sites or channels to choose their own topics should apply to both sites and channels or to neither. Some hostnames provide usable Topics API information to the classifier, and others don't. Some YouTube channel names provide usable information to the classifier, and others don't. (For example, Jalopnik dot com is about cars, but it's a made-up word so doesn't get classified, last I checked. And the YouTube channel "LazerPig" is not about lasers or pigs. Other site and channel names have better keywords in them.) You might be able to use the same classifier for hostnames and channels/sections if STPSs had to transform the channel name into something that would be a valid hostname ("My YouTube Channel" becomes "my-youtube-channel" or similar)
On the other hand, there are relatively few STPSs and it would be fairly straightforward to spot-check how accurately they were assigning topics to each channel, so it might be fine to have STPSs pass topics directly.
@jkarlin Yes, that seems to be another workable option.
On one hand, it seems like the choice of whether or not to allow sites or channels to choose their own topics should apply to both sites and channels or to neither.
Hmm, the two questions feel quite different to me. Changing a domain name is both much harder and much more user-visible than changing an invisible meta
tag on a page, for example. Using something user-visible seems like a huge contributor to maintaining quality of input data.
But a lot of this comes around to the question of what qualifications a site would need to have to be a STPS. Besides just being large and heterogeneous, if we think it would include a site being more "reputable" in some way, then perhaps that reputation would lead us to expect a lower chance being pushed useless/fabricated topics. (OTOH would you let Reddit onto the list? Seems all-but-guaranteed that some subreddits would claim a random absurd topic for each pageview.)
@michaelkleber Yes, I agree about the Reddit problem (one of the current best international news subreddits has a deliberately embarrassing and NSFW name in an effort to avoid ads, and they would probably pass the most embarrassing possible topics too). But there are few enough STPSs that the browser (or other STPS list maintainer) could check the privacy policy for whether it covers passing best-effort accurate topics or something else, and spot-check what the site is actually passing.
Some sites that are eligible to be STPSs will probably not see a reason to do it until some other party offers them an incentive to more accurately classify their audiences. In that case the other party will be in a position to require and check that the STPS is passing accurate topics, and the browser won't need to enforce.
the browser (or other STPS list maintainer) could check the privacy policy for whether it covers passing best-effort accurate topics or something else, and spot-check what the site is actually passing.
This strikes me as very unappealing, and we should do whatever we can to avoid ending up in that position.
Yes, but it's less unappealing the fewer privacy policies you have to read. The number of pages and topics required for STPS status can be set high enough to keep the work on the browser (or independent evaluator) easily manageable, and not all sites eligible for STPS will apply.
If we were to go in the direction of allowing metadata, then it might make sense to do so in a page-level opt-in way to address privacy concerns. My primary concern there is that I imagine very few pages would opt in, as it's unclear what their incentive would be. And without a significant user base, it's hard to justify the costs of training the new model and having it sit on users devices.
Hi @jkarlin, yes, that's a good point. There are at least two scenarios in which a large, multi-topic site will choose page-level opt-in or STPS.
Competition regulators require page-level opt-in and/or STPS when a company owns both a Topics API browser and a large, general-interest site that would otherwise benefit in an illegal or questionable way from domain-based Topics API training
Adtech intermediaries compensate large, general-interest publishers for providing additional data that they can use to increase ad revenue on other sites (in this case the intermediary is motivated to check on the site's topics, so there would be little administrative burden on the browser maintainers)
The first scenario is the one that seems to be the immediate problem. I know that either opt-ins or STPS would represent additional development work, but realistically considering the time required for browser development tasks compared to the time required for regulator and lawyer meetings, it seems to me that it's worth the additional time to implement Topics API in a way that takes some meaningful steps toward treating niche sites and YouTube channels in a comparable way.
Added #224 to cover the opt-in suggested by @jkarlin
Check to see if the page is from a Special Topics Provider Site (STPS), one that hosts content on many topics (such as youtube.com). If so:
Special Topics Provider Sites could enroll, using the existing enrollment process, specifying that they want to be part of the STPS program. The browser or an independent party could crawl the site and check that the site has at least "n" pages that are classified as at least "m" different topics before adding the site to the STPS list.
(simpler solution to achieve a large fraction of the benefits of https://github.com/patcg-individual-drafts/topics/issues/118 with less complexity and risk)