Treat sites with misrepresented or obfuscated ads as opted out

dmarti commented 2 years ago

When a site is offering misrepresented or obfuscated ad inventory, treat it as opted out of Topics API. Common obfuscation and misrepresentation practices include:

missing ads.txt file
missing OwnerDomain in ads.txt
lines in ads.txt that are inconsistent with the corresponding sellers.json files
lines in ads.txt that correspond to an is_confidential entry in a sellers.json file

Sites using these and similar practices should be treated as opted out of Topics API in order to protect users.

Users and web site authors have a shared interest in limiting limit leakage of audience data (and with it, ad revenue) from high-engagement, high-reputation sites to fraudulent, illegal, or otherwise harmful sites. Even if a user is not personally identifiable from a general audience data leak, the user has an interest in receiving the highest possible quantity and quality of ad-supported content in exchange for their investment of attention and computing resources, so has an interest in limiting the use of data related to them.

While classifying sites by content or engagement level requires extensive language, culture, and subject matter knowledge, it is possible to automatically detect when a site is offering misrepresented or obfuscated ad inventory (represented as from a different site, or with the site domain hidden from validator services). Industry-standard ads.txt and sellers.json files already include this information. Sites that wish to begin using Topics API could enable it by making appropriate information available to users, advertisers, and the service providers who work for them.

(Previous discussion at #58)

michaelkleber commented 2 years ago

it is possible to automatically detect when a site is offering misrepresented or obfuscated ad inventory

What are you thinking of here, Don? I understand you're proposing something of the form "If the site has property X, then Topics should be disabled", but I don't know how the browser can tell property X for the site.

Also, if your focus is sites that misrepresent or obfuscate themselves, then I'm not sure why the presence or absence of the Topics API is particularly helpful: wouldn't you expect such sites to just like about their Topics also?

dmarti commented 2 years ago

@michaelkleber The browser can check the ads.txt file on the current site with the corresponding sellers.json file(s), to detect mismatched account IDs, or missing ("confidential") domains in sellers.json.

Property X is "missing or inconsistent data detected in this site's ads.txt, or in a corresponding sellers.json" (Today, these files are crawled and checked by intermediaries. As part of moving the advertising market into the browser, the browser will need to be the one to check them in order to avoid losing ground on transparency and brand safety.)

This issue isn't about a problem site changing its own topics, it's about a user denying a problem site the ability to get access to the user's own topics. (As I understand the current state of the proposal, a site can't directly change its own topics, it would need to add content to fool the classifier.)

dmarti commented 2 years ago

Another useful example here: Google Served Ads on Sanctioned Russia-Linked Websites: Report

A significant problem for Topics API is leakage of audience information from sites that advertisers would choose to sponsor to sites that they would not choose to sponsor (for brand safety, sanctions compliance, or other reasons). There are some promising industry-standard approaches to dealing with this problem, but they depend on applying some basic checks before transferring commercially valuable data.

michaelkleber commented 2 years ago

The browser doesn't have any notion of "account ID", though — that's a question about how money moves around between different parties, but the browser only knows about domains and hostnames. I could imagine trying to bridge that gap — but also, browsers are slow movers here; I don't think the IAB wants any potential future changes to ads.txt or sellers.json to be forced to proceed at the pace of browser standardization or implementation.

Maybe this is much less of a problem with Topics, because (unlike FLoC) in Topics the only domains that can get information are ones that were already present as third parties on the sites that information was sourced from?

It seems to me that an ad tech company that is in a position to usefully call the Topics API is also in a position to check that the site they are on has all the appropriate ads-ecosystem-defined files in place. And if there are companies that don't perform the ads.txt kind of check you're talking about, then reputable publishers could decline to include them as 3p's on their site.

dmarti commented 2 years ago

The browser doesn't have to understand what "account id" means -- for this purpose it is just an opaque identifier to connect entries in ads.txt and sellers.json.

Not all sites update their ads.txt and sellers.json to new versions at the same time, and consumers of the files gain the ability to parse new features at different times. So RTB participants have to be able to handle a range of versions (kind of like HTML). If a new version of either industry standard comes out, it will still work with old browsers. IAB would not have to wait on browser support.

Topics API only transfers information across domains where the same third party is present, but widely used third parties are likely to exchange topics information across many sites, including both sanctioned and non-sanctioned sites. It would be difficult for all individual non-sanctioned publishers to pull a commonly used third-party script off their sites just because that same script is in use on a sanctioned site--or for publishers to take some kind of action to stop the third party from dealing with sanctioned sites. It's less disruptive to have the browser check for known patterns of misrepresentation or obfuscation.

michaelkleber commented 2 years ago

My continued apologies, since I clearly still don't understand some deep part of what you are proposing.

There are two different things, (A) + (B), that it sounds like you want a browser to do. The combination of the two of them is intended to incentivize the adoption of some IAB-standardized ads ecosystem transparency measures. That goal seems like it may be reasonable (though I acknowledge that there could well be contrary points of view within the ads world that I'm not aware of).

Thing (A) is that you want browsers to check some internal consistency requirement involving the /ads.txt file on the domain on which ads may appear, and some /sellers.json file on other domains. It would be great if you could link to something that algorithmically described what these consistency checks would involve. At a quick glance it seems like understanding /sellers.json might involve parsing hundreds of KB of JSON from domains that the browser would not otherwise contact, so of course there is cost here, but without an algorithmic description of the checks you're proposing, I don't really understand how substantial that might be.

Thing (B) is that you want the Topics APIs to be blocked unless these consistency checks pass, but I still don't understand how "block the Topics API" would incentivize the behavior you want. A malicious site could modify its JS environment to supply a fake Topics API interface; a malicious ad tech middleman could claim the site had provided a particular set of topics when it in fact provided none at all. The one thing that blocking the Topics API would accomplish is making it impossible for the non-ads.txt-compliant site foo.example to have any impact on the user's topics observed later on bar.example! But this does not seem to be at all related to the incentives you're discussing.

Maybe part (B) would work better as a feature request for the FLEDGE API, where if the API is unavailable then certain demand (FLEDGE Interest Groups) simply cannot buy on the page at all?

If my understanding here continues to be far off base, then maybe we should admit GitHub-issue defeat, and take this up interactively in a future WebAdvBG call.

dmarti commented 2 years ago

@michaelkleber Yes, I agree there should be an agenda item about the general issue of how in-browser ads can implement the same kinds of checks, intended to prevent good ads from ending up on bad sites, that are now possible in conventional web RTB. I will go ahead and propose one before the next meeting, thank you.

(These checks are a moving target -- since the start of ad-related business and community groups at W3C, features and adoption of those industry standards have both made progress. In-browser ads that work better than 2019 RTB ads will not be up to the level of the higher end 2022 RTB ads.)

A malicious site could modify its JS environment to supply a fake Topics API interface; a malicious ad tech middleman could claim the site had provided a particular set of topics when it in fact provided none at all.

Yes, a site can either put a JS function on the page to supply its own chosen topics or allow callers on the page to pass topics as returned from the browser. It is hard for an advertiser (or most intermediaries) to know in advance whether they are getting browser topics or site-supplied topics from a site. But an advertiser can choose to use a service that validates ads.txt and sellers.json. If the browser is known to validate, and validation by a service fails, the advertiser has a good idea in advance that any topics on that site are likely to be site-supplied and misrepresented (because they could not have come from the browser).

In its current form, Topics API is likely to train advertisers to try to run on more illegal/sanctioned/brand-unsafe sites because it offers tempting partial reinforcement -- a chance of a reward in the form of getting a high-value user on a low-value site. An important part of making in-browser adtech work constructively is to design the reward matrix to put as little reward as possible into this "good user"/"bad site" box.

One set of validation rules: Supply Chain Validation for Publishers

For discussion purposes, here is a simple first pass at a validation algorithm that the browser could do:

Fetch ads.txt for the domain of the current page.

For each line:
    Check that the line is a valid comma-separated line, if not go to next line
    Parse the first two comma-separated fields (advertising system domain, account id)
    Fetch the sellers.json for the advertising system domain. Parse the JSON entry for the given account id.
    If the domain in the sellers.json entry matches the domain for the current page, continue
    Otherwise, raise a validation error and stop

The obvious problem is that sellers.json for a widely used third party can be very large. However, all that the browser needs in order to validate a new domain is one record, a few tens of bytes, and domains with large sellers.json files are likely to have the programming skill and infrastructure to populate DNS TXT records with individual entries to improve performance. Domains with large sellers.json files can let the browser do a DNS lookup to get one entry from sellers.json instead of getting the whole file.

bmayd commented 2 years ago

Although I agree in principle that those who aren’t respecting the rules should not profit by taking advantage of those who are, I am rather uncomfortable with the notion of the user-agent taking responsibility for assessing compliance and imposing sanctions.

It concerns me in part because it appropriates the user-agent to police the ad-tech ecosystem, imposing the cost of enforcing its standards on users; I think wherever possible ad-tech should police itself independent of users.

An associated concern is that doing the sort of validation described in this issue increases, potentially by a very significant amount, the number of interactions and entities interacted with per page load, all of which increases opportunities for gathering data.

It also concerns me to put user-agents in the position of having to support potentially many standards not directly related to the primary mission of providing access to web content and equally to put standards in the position of relying on browsers. Each interest potentially adds significant complexity to the other, with all the problems complexity introduces; and, as Michael suggests, linking them means each will be forced to proceed at the pace of the other; both concerns recommend keeping interdependencies to an absolute minimum.

All that said, if we do pursue this or similar proposals, I suggest that we make better use of any work done by browsers to assess compliance by publishing a signal indicating what is out of compliance and allowing for a range of responses, not just a limited set hardwired by the browser.

dmarti commented 2 years ago

@bmayd The problem is that the user agent is taking on some of the decision-making roles that are currently made in other places. If functionality moves into the user agent but standards checking does not, then the average level of standards compliance experienced by a user would come down.

Google Chrome is already policing the adtech ecosystem for performance and annoyance violations. Standards enforcement would be a next step on a path that's already being followed, not a new initiative.

It will be much easier to convince users to consent to in-browser ads if browser developers can assure users that the in-browser ads are comparable to, or better than, existing ads in the areas of sanctions compliance and other sensitive issues.

dmarti commented 2 years ago

Related agenda item: https://github.com/patcg/meetings/issues/49

patcg-individual-drafts / topics

Treat sites with misrepresented or obfuscated ads as opted out #61