privacy-tech-lab / gpc-web-crawler

Web crawler for detecting websites' compliance with GPC privacy preference signals at scale
https://privacytechlab.org/
MIT License
3 stars 1 forks source link

Add crawler functionality for identifying sites' usage of GPP 1.0 vs 1.1 and write to database #110

Closed patmmccann closed 1 month ago

patmmccann commented 1 month ago

GPP 1.0 is no longer supported. If a site is broadcasting a GPP 1.0 signal, other entities on the page (eg Prebid.js or Google Ad Manager) generally will not understand it. You should just fail any site providing an API that no one understands. At Prebid, we're removing support for reading GPP 1.0 signals entirely and GAM already has.

SebastianZimmeck commented 1 month ago

Thanks, @patmmccann!

@katehausladen already looked into this issue. We will further evolve our code to reflect people's move from GPP v1.0 to v1.1.

@franciscawijaya, can you take the lead on this issue and implement the functionality @katehausladen described and as outlined below for our June crawl? @Mattm27, can you work with @franciscawijaya as needed to bounce off ideas and discussion? And, @katehausladen, it would be great if you were available for any questions that @franciscawijaya and @Mattm27 still have remaining and general observations to make sure we are not making any mistake here 😄.

What we need before the June crawl is logic for:

@katehausladen prepared this move already in the analysis.js. So, @franciscawijaya, that file is a good starting point together with @katehausladen's description. Please go ahead and create a new issue-110 branch to start the implementation ...

Once we have a record of which site is using which version, we can interpret the results in our analysis after the crawl accordingly.

patmmccann commented 1 month ago

As additional reference, prebid deleting gpp 1.0 has merged in but not yet released https://github.com/prebid/Prebid.js/pull/11461

Thanks!

patmmccann commented 1 month ago

You can see if GAM understood the gpp string by looking in the payload of the network requests it makes to itself. This is an example of a success on the call (filter to gampad in network tab) image

However, you might often see errors in this location for sites using gpp 1.0, which GAM and their recipients are treating as opt in (same as no signal)

franciscawijaya commented 1 month ago
  • Identifying whether a site implements GPP v1.0 or v1.1

After reading more onto GPP string and CMP API, it seems that the one with an update to 1.1 version is the CMP API which captures the information of the GPP string

Screenshot 2024-05-28 at 4 00 06 PM Screenshot 2024-05-28 at 4 00 01 PM

I have also confirmed with Kate that our current code looks for version 1.1 first and then 1.0. It then proceeds to store only one string value. While the string value would be the same, the difference would lie in how we access the value (i.e getGPPdata function in v1.0 or just the ping function in v 1.1. (Reference: https://github.com/privacy-tech-lab/gpc-web-crawler/issues/60#issuecomment-1664621751)

  • Unpacking and storing both GPP v1.0 and v1.1 values in our database (let's write a new column to our crawl data after the z column "gpp_version")

So, values captured by both CMP API v1.0 and v 1.1 should be the same given that the only change in the version is the removal of the getGPPdata function and merging the ping function.

franciscawijaya commented 1 month ago

You should just fail any site providing an API that no one understands. At Prebid, we're removing support for reading GPP 1.0 signals entirely and GAM already has.

Hypothesis as of now: I think the problem with this is then these entities are looking for the newly merged ping function from the ver 1.1 CMP API, instead of the getGPPdata value from the v1.0

image from https://github.com/privacy-tech-lab/gpc-web-crawler/issues/60#issuecomment-1664621751

What our code has: Our approach right now is to check for both getGPPdata and ping as discussed in issue-110 https://github.com/privacy-tech-lab/gpc-web-crawler/issues/60#issuecomment-1662842658

Possible solution: We can just completely scrap the getGPPdata function and solely use v1.1 ping CMP API

franciscawijaya commented 1 month ago

@patmmccann Could we clarify if, by changes in the GPP versions, you were referring to the changes in the CMP API versions? Currently, according to IAB, there is only one GPP version (1.0) but there are 2 CMP API versions (1.0 and 1.1) [CMP API captures the information of the GPP]

franciscawijaya commented 1 month ago
  • Identifying whether a site implements GPP v1.0 or v1.1
  • Unpacking and storing both GPP v1.0 and v1.1 values in our database (let's write a new column to our crawl data after the z column "gpp_version")

Action plan on our end:

patmmccann commented 1 month ago

@patmmccann Could we clarify if, by changes in the GPP versions, you were referring to the changes in the CMP API versions? Currently, according to IAB, there is only one GPP version (1.0) but there are 2 CMP API versions (1.0 and 1.1) [CMP API captures the information of the GPP]

Yes the CMP API version 1.1, which we probably should have called 2.0 but oh well. https://github.com/InteractiveAdvertisingBureau/Global-Privacy-Platform/pull/70

cc @lamrowena

patmmccann commented 1 month ago

We will be adding a new column in the crawl data that indicates whether the site uses CMP API v1.0 (one that has a getGPPdata function) or v1.1 (one that only uses ping function)

I suggest you get the version out of the ping response instead of testing for the absense of getgppdata. Some commercial vendors, eg @janwinkler, have backported getgppdata to assist in transitions yet still conform to the 1.1 spec and would have a signal recognized by platforms gathering the signal with the newly formatted eventlistener model.

franciscawijaya commented 1 month ago

Thank you @patmmccann! Your insight was very helpful in guiding the steps that I need to take to enhance the crawler functionality.

A note to self: I've confirmed that our current code do not test the version based on the absence of getgppdata. Instead, our injection script is modeled upon the update to the version 1.1 (ie. callback takes precedence over a return value since v 1.1 removed return values in favor of callback functions).

This would prioritize the 1.1 spec since all default gpp functions (including ping and getGPPData) used return values in v1.0 now have callback functions in v1.1. On the other hand, v1.0 would return values as expected, with some executing callback functions and some don't. Hence, sites would fall into 3 categories: v1.1 : callback only v1.0: executes callback and returns value v1.0: returns value only

Screenshot 2024-05-30 at 9 50 17 PM

In order to add the column in the crawler data, I believe these are the steps I should take: 1) adding new column in the rest-api (in the app.post in index.js)

2) explicitly collect the data of the different versions (which I believe has been analyzed under the function runAnalysis -- it is actually posted to debug)

Screenshot 2024-05-30 at 10 06 45 PM

3) log/store the data in the extension while a site is being analyzed (this is what analysis_userend[domain] is for)

4) populate the database with the collected information.

franciscawijaya commented 1 month ago

Logs/Update on adding the new column:

  1. I added a new variable for gpp_version before and after GPC on index.js (which should add a new column on the Crawl Data)
  2. logData the gpp_version under the runAnalysis and haltAnalysis (which should collect the data for the gpp_version)
  3. Added the new two variables under the analysisUserendSkeleton and analysisDataSkeletonFirstParties functions
  4. Under logData function, wrote an if statement for GPP_version to parse and put into objects gpp_version_before_gpc and gpp_version_after_gpc

A side note: while figuring out the code for the addition of the gpp_version column, I also encountered some questions about some functions in analysis.js that I need to clarify and am currently asking Kate about it.

Next step: I will be repackaging the gpc-analysis-extension into xpi file and test the extension locally before making a commit.

SebastianZimmeck commented 1 month ago

Excellent!

franciscawijaya commented 1 month ago

Update: After successfully repackaging it to xpi-file, I ran the analysis. Unfortunately, it gave me null values for the gpp version. I also tried to debug using the debug column and the code actually managed to analyze which gpp version the site is using (eg. in the example attached: it detected that it's using the v1.1 above the 'empty'), however it still fails in storing and printing it in the analysis column.

Screenshot 2024-06-02 at 9 06 03 PM

I am currently taking another approach in the logic: instead of checking the version both before and after gpc signal is detected, I'm trying to write the code for collecting just one gpp version (regardless of whether it is after or before gpc signal is detected).

franciscawijaya commented 1 month ago

I've successfully added the code to identify the gpp version that a site is using, collecting the data of the gpp version and store that data in the new column (gpp_version). The result of the crawler on a site will be as attached below.

Screenshot 2024-06-03 at 4 44 55 PM

Next step: I have tested for 2 sites while writing and testing the code. I will begin testing for a slightly bigger sample size (10-20 sites) to ensure that the gpp version that the sites are using are recorded properly.

SebastianZimmeck commented 1 month ago

Excellent! Well done, @franciscawijaya!

franciscawijaya commented 1 month ago

Using the April Crawl Data, I tested the crawl for sites that output GPP strings (as tested in April) to check the gpp-version. Out of the 20 sites that I picked from the data, it seems that all of them used v1.1 and that data is reflected in the gpp-version column accurately. I also tested on sites that do not output GPP strings before and after the gpc signal is sent and as expected the column would reflect a 'null' value for gpc_version, since their gpp_before_gpc and gpp_after_gpc would also output a 'null' value.

In my testing and debugging of 20 sites, I have yet to encounter a site (that was crawled and identified to have a GPP string in April Crawl) that uses the v1.0. I'm not sure if this indicates and confirms that most sites have switched to the v1.1.

While I'm thinking of continuing my manual testing of other sites from the site list that had gpp strings in April Crawl to make sure of this switch, I wonder if there is a way for me to get a hold of sites that are still using v1.0 right now and test those sites out, instead of going through our site list.

SebastianZimmeck commented 1 month ago

I wonder if there is a way for me to get a hold of sites that are still using v1.0 right now and test those sites out, instead of going through our site list.

I tried searching BuiltWith to find sites with GPP. But it does not detect GPP. Maybe, there are similar lead generation sites like BuiltWith that do, though.

Another option may be to try the Internet Archive and Archive.today to see if they store sites with all their third parties.

It is also possible to create your own site with GPP v1.0. But let's not go there unless it is absolutely necessary.

SebastianZimmeck commented 1 month ago

Other than that, Google search for GPP v1.0 code snippets may get some relevant search results.

franciscawijaya commented 1 month ago

@patmmccann Hello! Would you mind sharing sites that still use the v1.0 when you came across this issue? My sample sets of sites seemed to have switched to 1.1 but I'm currently still looking to test our crawler for sites that still use the v1.0 version. I would greatly appreciate any help. Thank you in advance!

franciscawijaya commented 1 month ago

I've tried using web.archive to check sites before the release of GPP v1.1 to see if I can test the gpp version (which should be v1.0). However, I was not able to do this as web.archive does not seem to store sites with their third parties. I confirmed this by comparing the current site and the archive version on the web console which showed that the current sites stores a GPP string while the web.archive version does not.

Screenshot 2024-06-07 at 12 32 31 AM Screenshot 2024-06-07 at 12 32 39 AM

I have also tried testing it on some other sites on our crawl list but I have yet to encounter v1.0. For now, I think we can just go ahead to try this code for the June crawl.

Next step: I will soon request to merge this branch with main and ask Matt to hep confirm reviews the code on his end via the local test and then close this issue. I will also be preparing for the crawl and began by this weekend.

SebastianZimmeck commented 1 month ago

Sounds all good, @franciscawijaya!

patmmccann commented 1 month ago

I am having trouble tracking down some of the old gpp implementations at the moment. Perhaps other outreach has been quite successful!

franciscawijaya commented 1 month ago

Merged to the main branch!