patcg-individual-drafts / topics

The Topics API
https://patcg-individual-drafts.github.io/topics/
Other
598 stars 189 forks source link

Prevent or limit selective training by callers #225

Closed dmarti closed 1 year ago

dmarti commented 1 year ago

Topics API originally required an exchange of user topics data for user topics data: if a caller wanted to see Topics API data pertaining to a user, the caller had to allow for Topics API data to be collected on the current site. However this is no longer in effect for all parties. Since https://github.com/patcg-individual-drafts/topics/pull/80 it has been possible for callers to retrieve topics data for a user without allowing the browser to observe topics.

A third-party caller can selectively pass {observe:false}) in order to optimize the Topics API data collected for a given user. Because the caller knows in advance which topics the browser would assign to a given site, the caller can choose to pass {observe:false}) on sites that would yield lower-value topics. The result would be that the selective caller's audience would have disproportionately high-value sets of topics, resulting in higher revenue for that caller and pressure on other third-party callers to also observe more selectively.

Meanwhile, a publisher site does not have the same level of control: an intermediary can selectively train on only the highest-value publisher data, but a publisher does not have the ability to optimize how its own audience data is collected and trained on.

There are probably several ways to address the imbalance. Some options:

(This issue is based on a discussion at the 31 Jul 2023 Topics API call. Notes at: https://github.com/patcg-individual-drafts/topics/tree/main/meetings )

michaelkleber commented 1 year ago

Could you explain why this behavior would be a problem?

Suppose an ad tech is anti-poetry, and feels that Taxonomy v1 topic 102, "/Books & Literature/Poetry", is useless to them. They decide to pass {observe:false} on pages about poetry.

Further suppose the user really is interested in poetry, so that it is indeed one of their top-5 interests for the week.

Then in subsequent weeks, the anti-poetry ad tech might call the Topics API and receive no topic at all, while another ad tech on the same page would receive the Poetry topic. This seems pretty much the same as if both ad techs received the Poetry topic, and one ad tech decided to ignore it and not use Poetry in picking which ad to show.

I hope we can agree that it's fine for an ad tech to ignore unhelpful information at targeting time. So I don't see why ignoring the information earlier in the process, as you describe, is something we should worry about.

dmarti commented 1 year ago

Is there any data that shows that every topic is revenue positive or neutral? In early testing we spotted some topics or combinations of topics that are correlated with lower than average ad revenue (roleplaying game topics without parenting topics was really low). A caller would presumably only want to avoid training on just enough sites to avoid creating any revenue-negative combinations.

Is "622 /Travel & Transportation/Hotels & Accommodations/Vacation Rentals & Short-Term Stays" + "102 /Books & Literature/Poetry" always worth the same as or more than "622 /Travel & Transportation/Hotels & Accommodations/Vacation Rentals & Short-Term Stays" alone?

michaelkleber commented 1 year ago

I have no data of the sort you ask for. But if an ad tech wants to bid less when they see topic X compared to no topic at all — if they find that bidding strategy beneficial for whatever reason — then surely they would want to see topic X precisely so that they could get whatever benefit it confers. So this doesn't seem like a use case for "ban selective calling" at all.

dmarti commented 1 year ago

Can't the party collecting the topics and the party bidding for the impression be different, though? Seems impractical for every possible bidder to check what selective training is being done by every possible caller. Unless every bidder is also a caller?

michaelkleber commented 1 year ago

Since Topics is relatively new, I would not even expect that the folks planning to experiment are sure of the long-term answer to which parties will end up being callers.

But let's guess that SSPs get topics and pass them to DSPs (which seems very reasonable). Every SSP is already on a different set of websites, and so every SSP's topics will already be affected by what they happen to observe. How is an SSP's selective use of {observe:false} any different?

dmarti commented 1 year ago

That's a good point. I think you're right on this one -- the SSP should be able to call Topics API in an optimized way, in order to collect and present the topics data for a particular user that they believe to be most likely to attract a high bid.

It looks like we can close this issue and focus on #92 -- since SSPs can do retrieve without observe in order to optimize their results, making the same option available to publishers seems like a better direction.

If nobody has any objections I'll close this.

dmarti commented 1 year ago

Closing, will comment at #92