patcg-individual-drafts / topics

The Topics API
https://patcg-individual-drafts.github.io/topics/
Other
622 stars 230 forks source link

Callers getting topics according to a priority list #42

Open lbdvt opened 2 years ago

lbdvt commented 2 years ago

A caller may not get the same signal from every topic for selecting an ad, for instance "Auto insurance" may be more useful than "Vegan Cuisine".

Would it be possible for callers to provide a ranked priority list of topics, for example at a .well-known location, and for the API to return topics, if eligible, according to this priority list?

jkarlin commented 2 years ago

I like the idea in spirit, but in practice it runs up against a privacy concern, that is if different callers on a site receive different topics, then those callers can talk to each other and quickly learn way more topics per week for a user than intended.

So, then you could imagine say that the first caller for the site for the week gets to set a preference and the others on the page are stuck with the first caller's preference. But that doesn't seem fair either. So the plan is to choose randomly.

lbdvt commented 2 years ago

More generally, I'm worried about the signal that can be gained from the topics, and how useful it can be for "advertising based on generic interests".

If, for instance, YouTube, Google, and Facebook call the Topics API on their pages, a very significant portion of users may have "Online Video", "Search Engines", and "Social Network" in their top 5 topics, which I don't see as very helpful for advertising.

What are your thoughts on this?

jkarlin commented 2 years ago

I think we ought to explore this issue. As a simple idea, we could weight topics by overall frequency on the web (e.g., find the topics of pages in the HTTP Archive and weight topics inversely by frequency). This would help to overcome the issue you've described.

There are other concerns that I have as well in picking the top topics. For instance, let's say the user frequently visits pages about two different sports, but neither individual sport has enough to make it a top 5 topic for the week. But combined, they would be. Should the parent in the hierarchy, sports, then be chosen?

stguav commented 2 years ago

Since #46 was merged, restating some points here. Let me request that we clarify the current proposal on how topics are ranked, or make the uncertainty more explicit.

The main questions there that are not explicitly here:

Regarding the taxonomy hierarchy, one convenient way of handling it:

jkarlin commented 2 years ago

I agree that keeping hierarchy in mind is likely to trend toward higher-level items, which is a concern. I think the TF-IDF approach has potential. Basically, we'd want to measure the inverse frequency of topics (as opposed to documents) based on user traffic. This does require knowledge of how often users visit various sites and what their topics are. Topics can be derived via the Topic API model. But the traffic data would ilkely need to come from Chrome's data which isn't public. That is, unless someone is aware of a good public dataset? I'll look into what can be done. On the bright side, the resulting list of weights would be small (~350) and each topic would be represented by a large numbers of users. So I think we'd have some pretty solid differential privacy properties with a little bit of noise.

jkarlin commented 2 years ago

I’d like to stick to the topic of initial weighting for each topic based on its value before we go into hierarchical concerns, repeated topics, etc. Those seem like optimizations that should come after we have a better idea of what a topic is actually worth.

So far we’ve discussed using inverse frequency of topics as a proxy for value, but I’d like to see if we can get a more direct idea of commercial value first. Perhaps the IAB Tech Lab can help us out here.

Hey IAB Tech Lab! (@angelinaeng, @bjd326) We're pretty sure that there is room for improvement in how the Topics API weighs the user’s top 5 topics. We'd like to utilize a notion of topic value that represents the opinions of a large body of the industry. Do you have (or might you be interested in creating) some sort of indication of value for each of the topics in your content taxonomy that we could then apply to Topics as well? Even something as simple as a 1-to-5 scale of commercial utility could be a useful foundation. Perhaps a discussion we could have in an upcoming IAB meeting.

jkarlin commented 1 year ago

Another option to get at a notion of commercial value is to use Chrome data. Chrome can determine (in many cases) when a user navigates due to an ad click. We could look at (in an aggregated, differentially private form) the topics of the ad landing pages, and note their frequency. More common landing page topics would be deemed more commercially valuable. Obviously this is imperfect (it excludes topics that may gravitate toward brand ads that don't necessarily require clicks, Chrome's heuristics miss some ads, infrequent ad categories could have huge value), but I'm confident that it would be closer to real commercial value than simple inverse topic frequency across all pages.