patcg-individual-drafts / topics

The Topics API
https://patcg-individual-drafts.github.io/topics/
Other
607 stars 207 forks source link

Should sites be able to set their own topics via response headers? #1

Open jkarlin opened 2 years ago

jkarlin commented 2 years ago

The classifier is likely to be wrong from time to time and sites might which to adjust the topics returned for their site. One way to accomplish that is to allow sites to set their own topics via response headers.

The concern with this is if sites decide that some topics are more valuable than others, and decide to only list valuable topics, polluting the input to the API. How real is this risk?

jdevalk commented 2 years ago

Would be awesome if, if/when this happens, we could replace “response headers” with Schema.org metadata.

gui-poa commented 2 years ago

Hi, all! I may have misunderstood how the API works to infer topics, especially in the part where it talks about hostnames. What about news sites that have thousands of articles on different subjects, but with a single generic hostname? It seems to me that the publisher itself could match its CMS tags with the taxonomy list... That would be a great case to users who likes to read sports articles, receive AD sports, recipe articles / recipe ads, etc...

The way it is proposed, the old fight between subdomains x directories would "come back". Now not for SEO, but for advertising. And there are already many publishers using directories with only one domain.

dmarti commented 2 years ago

Sites that are misclassified because they have some pages with a different or atypical topic could label those pages as a separate section, allowing for the top-level section to be more representative of the general topics on the site.

Breaking pages out into a section would be less risky than manual topics, because the classifier is still in the loop. See #17

pugzor commented 2 years ago

Seems acceptable that they might be able to set their own Topics, or at least suggest one. Not sure what the benefit to site owners would be though unless the Topics classification is repurposed (unless I'm missing something).

I'd suggest websites should have the option of opting-out of Topics too (or ideally, having to opt-in). Again, not sure of the benefit to the site owners in all but extreme cases, where customers are blindly loyal and are marketed to by competitors for the first time, but it should still be possible. There's nothing stopping classification of websites by means of text processing so it's a circular argument. I'm sure site owners would appreciate the mechanism though.

dmarti commented 2 years ago

One of the risks of allowing sites to set their own topics is that colluding groups of deceptive or low-engagement sites will claim topics that are associated with high ad revenue. A site would be able to artificially get more lucrative ads by running some user workflows through a page on a different domain that claimed a better set of topics than the user originally had.

Requiring a minimum number of visits to pages with a given topic is another way to address this risk. See #19

joshuakoran commented 2 years ago

In the same vein as the above over-generalization risks, mis-classification risks and self-attributed misleading classification risks that can all impact marketer effectiveness that correlates to publisher revenues, this seems to bringing up the unsettled question of determining "quality."

Marketers are trying to match their content to the "right" audience, which is not adequately defined by the sector of goods/services they compete within.

According to the IAB Content Taxonomy the following URL (https://www.edmunds.com/tesla/sedan) could be reasonably be classified with 6 IDs, each of which might appeal to a different characteristic of a prospective buyer:

Which is the "right" topic to assign to this page or an interest for someone who interacts with content like this "enough" to best match a given marketer's ad?

jkarlin commented 2 years ago

Is there not a risk of colluding groups of high-engagement sites playing the same game?

It does seem possible to prevent a site from directly gaining from the topics it suggests by not allowing the topics the site suggests to be returned in calls to the API on that site. But the colluding sites issue still remains.

dmarti commented 2 years ago

I agree. I don't see how it would be practical to let sites assign their own topics. Too many opportunities for topic manipulation by colluding sites.

(It does makes sense for users to be able to install extensions that would zap topics they have a problem with and/or add topics they are actively interested in getting ads about: #25)

pugzor commented 2 years ago

There's definitely a risk associated with that. Maybe the solution is that a site 'suggested' Topic (or Topics) isn't a guarantee of the setting? I'm not sure of the exact mechanics but maybe if there's enough of a semantic link between the site/page content and the 'suggested' Topic, then it's adopted, otherwise ignored. Or in cases where the signals for the inferred Topic are weak, there's a higher likelihood of the 'suggested' Topic being adopted.

On Sat, 29 Jan 2022, 2:56 am Don Marti, @.***> wrote:

I agree. I don't see how it would be practical to let sites assign their own topics. Too many opportunities for topic manipulation by colluding sites.

(It does makes sense for users to be able to install extensions that would zap topics they have a problem with and/or add topics they are actively interested in getting ads about: #25 https://github.com/jkarlin/topics/issues/25)

— Reply to this email directly, view it on GitHub https://github.com/jkarlin/topics/issues/1#issuecomment-1024417376, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACPEH6WNC6IAGD6RPKOL3R3UYLDCHANCNFSM5MQRPF4Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you commented.Message ID: @.***>

igrigorik commented 2 years ago

Would love to have a well-known mechanism for sites to "suggest" a set of topics. If and how the browser factors them into the algorithm can be left as an intentional black box, to allow for anti-collusion / spam, etc., but ideally, it would serve as an input into the decision process. In particular, might be useful for sites with non-descriptive or non-obvious hostnames, etc.

In terms of the signaling method, ideally, there should be a response header and an equivalent <meta http-equiv> or similar. The use of the latter can be constrained to, must appear before script, part of HTML (not dynamically created, etc). Some sites don't have a simple way to alter headers, and vice versa.

bmayd commented 2 years ago

The concern with this is if sites decide that some topics are more valuable than others, and decide to only list valuable topics, polluting the input to the API. How real is this risk?

It is safe to assume a meaningful subset of folks will do anything they can to make their pages as valuable as possible and that most folks who enable the API will look at ways to "optimize" its impact, the incentive is to be valuable, not accurate. The result will presumably be that self-definitions fall somewhere between very accurate and very inaccurate and would likely be deemed too unreliable to be trusted unless there was some sort of validation and quality rating.

It is analogous to the difficulty with publisher-supplied page signals like meta-tags and descriptions, which run the gamut from very trustworthy to totally unreliable. However, where with publisher-supplied signals a buyer can check pages, develop quality scores for domains and ignore page signals from unreliable sources, with Topics consumers of the signal aren't allowed to know the domains a given browser has based the Topic assignment on and so has no means of gauging the trustworthiness of the Topics signal for that browser.