patcg-individual-drafts / topics

The Topics API
https://patcg-individual-drafts.github.io/topics/
Other
624 stars 234 forks source link

What topic taxonomy should be used long term? #3

Open jkarlin opened 2 years ago

jkarlin commented 2 years ago
patmmccann commented 2 years ago

Taxonomy should be detailed on datalabel.org if it is new; it could be a vendor specific taxonomy (eg code 600) or is could be iab audience taxonomy 1.1 (segtax 4) in openrtb

jdevalk commented 2 years ago

I’d suggest using the ODP’s taxonomy: http://odp.org/

johnsabella commented 2 years ago

Thanks Josh for the very interesting API spec. Keeping the taxonomy within the ad-business standards body, IAB TechLab is the best place for this long term. Even if the taxonomy today is not ideal, getting involved and helping to form it within that setting will have great benefits to the industry overall. https://iabtechlab.com/standards/audience-taxonomy/

JamesFinlayson-zz commented 2 years ago

Google already has an NLP-based API for topic taxonomies through content categories. This list of 620 categories is pretty comprehensive and has existed since at least 2018 so is, we can assume, pretty battle-tested at this point. The use of NLP removes the need (or opportunity!) for manual assignment and thus the inevitable gaming that would occur around it. As a result, would it make sense, at least to begin with, to lean on that existing list?

hu0p commented 2 years ago

@jdevalk Are you suggesting this as a starting point or are you suggesting the owners of ODP should be maintainers? Their About page is interesting, but it doesn't seem to mention much beyond the origins of the taxonomy and some funding concerns. I don't have any particular objections here. I'm only curious about additional information you might have (particularly about who is behind ODP and their interest in involvement) and any other general supporting thoughts for your suggestion.

@johnsabella Agreed on all points. I spotted this in the proposal and was drawn to it. I particularly like the idea of a relevant third-party standards organization maintaining and improving the taxonomy. However, I'm very curious about the applicability of this list. The bulk of it consists of demographic and purchase intent data, with only 496 interests listed. It sounds like the purchase intent and demographic data couldn't be directly adapted to work within the context of the topics API. @jkarlin, could you speak to this?

@JamesFinlayson, the list you linked appears to have a handful (not enough to completely rule it out by any means) of sensitive topics. Otherwise, it appears to be very similar to the starting list that is currently in this repo. A diff of the two lists reveals a lot of overlap with some rephrasing or careful omission. @jkarlin or someone more informed than me would have to say more, but I wouldn't be surprised if the current starting list is based in part on what you shared.

@jkarlin Apologies if this is mentioned in the proposal (I haven't had a chance to look at it again since yesterday), but would it make sense for the API to include a method to return the full list of up-to-date topics over time?

bquinn commented 2 years ago

Hi @jkarlin and all, FYI we at IPTC maintain a news- and media-specific subject taxonomy (controlled vocabulary), the IPTC Media Topics, at https://cv.iptc.org/newscodes/mediatopic/. It is available as a SKOS vocabulary in various forms of RDF (Turtle, RDF/XML and JSON-LD) in 12 languages and language variants. The IPTC CV server guidelines for tips on how to download the vocabularies or individual terms using URL query strings or HTTP content negotiation.

I would suggest that whatever taxonomy is used in the long term, it is represented in SKOS or something like it, where each term in the vocabulary has its own URI. This allows for lookup to obtain the name in multiple human languages, lets you specify the hierarchy in machine-readable form, and allows mapping across vocabularies.

For example, we already map most Media Topics terms to Wikidata concept URIs, which then allows for mappings to other subject vocabularies.

Good luck with your project!

JamesFinlayson-zz commented 2 years ago

@hu0p - yes, they are very similar; thanks for the useful diff link. My main point was really in favor of using a library, such as but not necessarily the Content Categories API provides, to automatically set the topics rather than allowing users to set these. This is important, I believe (a) to address @jkarlin 's transparency point by allowing anyone to check a site's declared topics, (b) to reduce the potential for topic-bloat over time, and (c) to reduce the ease at which the system can otherwise be gamed.

wayne-innity commented 2 years ago

In order to be useful to Ads industry, highly appreciate if Topics can adhere to the taxonomy list as published by IAB (https://www.iab.com/guidelines/content-taxonomy/)

avuim commented 2 years ago

In order to be useful to Ads industry, highly appreciate if Topics can adhere to the taxonomy list as published by IAB (https://www.iab.com/guidelines/content-taxonomy/)

Could not agree more, since aside Chrome's topics API there is other browsers and other methods to signal content taxonomies to the demand side which most rely on IAB techlab content taxonomy. Aligning here to an industry standard makes sense. The four tiers given in content taxonomy (currently 3.0) provide enough flexibility to decide on the depth/amount of topics provided.

avuim commented 2 years ago

Other (even partly Google driven) reasons to follow IAB TechLab Taxonomy and therefore align to industry standards and initiatives:

lbdvt commented 2 years ago

To increase the utility of Topics, and for Topics to better inform marketers on users' buying habits and intents, we suggest using a granular commerce taxonomy such as Google Product taxonomy

dmarti commented 2 years ago

@lbdvt Topics API is information that is provided on first visit to a new site. More specific lists tend to have a lot of topics that aren't really material that a user would be likely to volunteer when making a first impression. Some examples include:

5824 - Health & Beauty > Personal Care > Oral Care > Denture Cleaners 6562 - Health & Beauty > Personal Care > Ear Care > Ear Wax Removal Kits 7336 - Health & Beauty > Health Care > Medical Tests > HIV Tests 1695 - Cameras & Optics > Optics > Scopes > Weapon Scopes & Sights 5506 - Apparel & Accessories > Clothing > Outerwear > Chaps

(If I show up with those Topics on a job application site, am I more or less likely to get a call back from the hiring manager?)

dmarti commented 2 years ago

The underlying problem is that any taxonomy that's specific enough for the needs of legit advertisers and trusted ad-supported publishers is too specific for users to share with untrusted sites, or with sites that are trusted in a non-shopping context (you don't share your bass guitar playing or home biochemistry lab interests with the property management site where you're applying for an apartment) One possible way to allow for more detailed taxonomies to be used would be a policy on authorized callers: https://github.com/patcg-individual-drafts/topics/issues/87