Scrape ads.txt, app-ads.txt and sellers.json

streitl commented 2 years ago

Background

ads.txt

ads.txt is a mechanism allowing publishers (a.k.a. website or content owners) to specify who are the parties authorized to sell their inventory (ad spaces/impressions). Each publisher subdomain can have an ads.txt file, which is just a text file having a list of entries with 3 to 4 fields. For instance, here's a sneak peek at https://www.lemonde.fr/ads.txt:

#07.12.2020
# Le Monde
appnexus.com, 8253, DIRECT, f5ab79cb980f11d1
appnexus.com, 1608, RESELLER, f5ab79cb980f11d1
appnexus.com, 3500, RESELLER, f5ab79cb980f11d1
appnexus.com, 8494, RESELLER, f5ab79cb980f11d1
appnexus.com, 8499, DIRECT, f5ab79cb980f11d1
appnexus.com, 1314, RESELLER #EBTL
google.com, pub-2366164365855963, RESELLER, f08c47fec0942fa0
google.com, pub-3391936129161967, RESELLER, f08c47fec0942fa0
...

Here's a description of the different fields:

(mandatory) specifies the domain name of an advertisement system
(mandatory) specifies the number of the account that the publisher uses on the advertisement system of field 1
(mandatory) specifies the relationship between the publisher and the advertisement system: a. DIRECT: there are no intermediaries, and there is likely a contract between the publisher and the advertisement system b. RESELLER: the publisher has authorized some third party to control its account (field 2) at the system of field 1 and to resell its ad space
(optional) certifies the advertisement system with some certificate authority

app-ads.txt

There is a similar mechanism called app-ads.txt that allows an app owner to specify the parties authorized to sell their inventory. Basically, the app page on the store that distributes it (e.g. Google Play) points to a domain that contains an app-ads.txt file with the same format as ads.txt. So for instance, on Google Play the application Twitter specifies that its domain is twitter.com and that there is an app-ads.txt there (so at https://twitter.com/app-ads.txt).

Sellers.json

Another similar mechanism is called Sellers.json, but it is specified by an advertisement system and not by a publisher. Each advertisement system domain can have a Sellers.json file that lists the publishers and intermediate exchanges that are authorized to sell their inventory through this system. Note that this is a JSON file, and the interesting entries are those at the key sellers. For instance, the ad exchange system Xandr has this file at https://www.xandr.com/sellers.json, and it looks like this:

{
  "contact_email": "sellers-json@xandr.com",
  "version": "1.0",
  "identifiers": [
    {
      "name": "TAG-ID",
      "value": "f5ab79cb980f11d1"
    }
  ],
  "sellers": [
    {
      "seller_id": "74",
      "seller_type": "INTERMEDIARY",
      "domain": "pubmatic.com",
      "name": "PubMatic"
    },
    {
      "seller_id": "181",
      "seller_type": "INTERMEDIARY",
      "domain": "google.com",
      "name": "Google AdExchange"
    },
    {
      "seller_id": "226",
      "seller_type": "INTERMEDIARY",
      "domain": "microsoft.com",
      "name": "Microsoft Media Network"
    },
    ...
  ]
}

So in summary, Sellers.json improves transparency for the advertisement systems and helps prevent fraud, while ads.txt protects the inventory of the publishers.

Idea

The extension could benefit a lot from retrieving ads.txt information from the websites visited by the user; and Sellers.json information from the advertisement systems that bid on the observed ads.

First, doing this could allow us to verify (audit) whether the constraints of these specifications are verified (i.e. nobody is selling ad spaces that they are not allowed to).

Also, it could be interesting to keep track of the modifications to these files, as they could reveal interesting insights, e.g. some publisher stops selling ad spaces through a specific advertisement system after it realizes that some of the served ads are from the far-right.

Another idea is to use the "topology" information from ads.txt to better understand the relationship between ad price and user targeting.

More ideas will be added later.

loleg commented 2 years ago

Some quick feedback, without knowing too much about this system, or even industry terminology like "inventory" or "ad exchange". This issue is a collection of ideas - to better define the work, I would break it down into issues with specific tasks as checklists. You could also publish this as concept document, and refer to sections in it inside of the issues. Otherwise the description is clear and readable. With some well defined tasks, I could see this type of project being easily distributed among multiple developers or even crowdsourced in its execution.

The idea makes it seem like the scraping would need to be done simultaneously. However, more pragmatically you could aggregate the data from a set of publishers, then a set of advertisers, then cross-reference them after the fact. In general, scraping should avoid unnecessary repeat visits by caching the data collected, and refreshing as often as needed (which in this case is probably not very often).

It's interesting that you use topology to describe an idea. The web is a graph of nodes and relationships, I could see how you might use graph-based databases here, like neo4j (medium post, moma example) or tigergraph (tor example). So it might conversely to what I wrote above be more effective to not distinguish too much from the publishers and advertisers. Just create nodes and identify what files they serve to classify their data appropriately. You can disaggregate after the fact.

I wouldn't say that ads.txt is a mechanism: it's at best a standard. It would be good to think about not just the need to scrape the data, but also to validate it and help improve it as a service to the community. At least, this is what a good standards body like the W3C would do. Speaking of whom, I believe this is a topic of discussion in the Credible Web Community.

Personally I like the initiative here of using open web data to raise transparency and accountability. Just try to do it in a well organized way and keep a wider perspective to avoid "shooting in the foot".

mvidonne commented 2 years ago

@streitlua something like http://corrupt.marketing/?

pdehaye commented 2 years ago

Yes but this is not open data.

On Fri, Nov 12, 2021, 11:08 Marie-Pierre @.***> wrote:

@streitlua https://github.com/streitlua something like http://corrupt.marketing/?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/hestiaAI/ad-radar/issues/21#issuecomment-966978510, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAY7MXZXG3PLQFEM727W5S3ULTRQZANCNFSM5HSCWYZA .

mvidonne commented 2 years ago

@streitlua check the statistics for the top 400 French speaking websites done by the CNIL https://linc.cnil.fr/webpub-adstxt-sellersjson/ads_study.html @fquellec for the dataviz

https://github.com/LINCnil/Ads.txt-et-Sellers.json

under open licence 2.0

foucault-dumas commented 2 years ago

@streitlua check the statistics for the top 400 French speaking websites done by the CNIL https://linc.cnil.fr/webpub-adstxt-sellersjson/ads_study.html @fquellec for the dataviz

https://github.com/LINCnil/Ads.txt-et-Sellers.json

under open licence 2.0

foucault-dumas commented 2 years ago

Another interesting tool: https://sellers.guide/

mvidonne commented 2 years ago

@ffsinger is there a way to save properly all the info https://sellers.guide/domain/wikistrike.com

ffsinger commented 2 years ago

@ffsinger is there a way to save properly all the info https://sellers.guide/domain/wikistrike.com

I haven't been following the discussions on these issues. Would you like to save the analysis result for a specific domain or a list of domains ? Which results specifically ? In a computer-readable or human-readable format ? For what purpose ?

Depending on the answer to these questions, we could build a (more or less complex) scraper.

foucault-dumas commented 2 years ago

sellers.guide just added a ads.txt cleaner on their page we just saw an interesting webinar by them with @mvidonne. She summarized it here.

I don't know how to answer @ffsinger's question, the idea is to use ads.txt and sellers.json to add live knowledge on the knowledge harvested on adds by adradar. For example display (and explain) the links between the intermediary identify as the winning bidder for an add and other actors of the adtech ecosystem. Can ads.txt and sellers.json (and sellers.guide) be used to do so?

Edit: sellers.guide's slides

pdehaye commented 2 years ago

Looks like it was an interesting seminar!

I am thinking that this needs a strategy session involving @mvidonne, @foucault-dumas and myself, before involving developers more? Along the lines of "AdRadar needs to evolve towards helping find out during regular web browsing interesting situations for which it is worth investigating further the data angle (possibly through SARs)"

foucault-dumas commented 2 years ago

Looks like it was an interesting seminar!

I am thinking that this needs a strategy session involving @mvidonne, @foucault-dumas and myself, before involving developers more? Along the lines of "AdRadar needs to evolve towards helping find out during regular web browsing interesting situations for which it is worth investigating further the data angle (possibly through SARs)"

Would you send us an invitation? You have the fullest agenda of all

pdehaye commented 2 years ago

done

On Thu, Jan 13, 2022 at 3:45 PM foucault-dumas @.***> wrote:

Looks like it was an interesting seminar!

I am thinking that this needs a strategy session involving @mvidonne https://github.com/mvidonne, @foucault-dumas https://github.com/foucault-dumas and myself, before involving developers more? Along the lines of "AdRadar needs to evolve towards helping find out during regular web browsing interesting situations for which it is worth investigating further the data angle (possibly through SARs)"

Would you send us an invitation? You have the fullest agenda of all

— Reply to this email directly, view it on GitHub https://github.com/hestiaAI/ad-radar/issues/21#issuecomment-1012202163, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAY7MX2A4PXZF42TKGDIP7LUV3QQ3ANCNFSM5HSCWYZA . You are receiving this because you commented.Message ID: @.***>

foucault-dumas commented 2 years ago

sellers.guide just added a ads.txt cleaner on their page we just saw an interesting webinar by them with @mvidonne. She summarized it here.

I don't know how to answer @ffsinger's question, the idea is to use ads.txt and sellers.json to add live knowledge on the knowledge harvested on adds by adradar. For example display (and explain) the links between the intermediary identify as the winning bidder for an add and other actors of the adtech ecosystem. Can ads.txt and sellers.json (and sellers.guide) be used to do so?

Edit: sellers.guide's slides

@mvidonne also found this very interesting tool, which is like the Markup tool but more complex (and uglier)

foucault-dumas commented 2 years ago

Isn't that something we should dig in? https://github.com/InteractiveAdvertisingBureau/openrtb/blob/master/supplychainobject.md

hestiaAI / ad-radar