Closed patmmccann closed 1 month ago
Thanks, @patmmccann!
@katehausladen already looked into this issue. We will further evolve our code to reflect people's move from GPP v1.0 to v1.1.
@franciscawijaya, can you take the lead on this issue and implement the functionality @katehausladen described and as outlined below for our June crawl? @Mattm27, can you work with @franciscawijaya as needed to bounce off ideas and discussion? And, @katehausladen, it would be great if you were available for any questions that @franciscawijaya and @Mattm27 still have remaining and general observations to make sure we are not making any mistake here 😄.
What we need before the June crawl is logic for:
@katehausladen prepared this move already in the analysis.js. So, @franciscawijaya, that file is a good starting point together with @katehausladen's description. Please go ahead and create a new issue-110
branch to start the implementation ...
Once we have a record of which site is using which version, we can interpret the results in our analysis after the crawl accordingly.
As additional reference, prebid deleting gpp 1.0 has merged in but not yet released https://github.com/prebid/Prebid.js/pull/11461
Thanks!
You can see if GAM understood the gpp string by looking in the payload of the network requests it makes to itself. This is an example of a success on the call (filter to gampad in network tab)
However, you might often see errors in this location for sites using gpp 1.0, which GAM and their recipients are treating as opt in (same as no signal)
- Identifying whether a site implements GPP v1.0 or v1.1
After reading more onto GPP string and CMP API, it seems that the one with an update to 1.1 version is the CMP API which captures the information of the GPP string
I have also confirmed with Kate that our current code looks for version 1.1 first and then 1.0. It then proceeds to store only one string value. While the string value would be the same, the difference would lie in how we access the value (i.e getGPPdata function in v1.0 or just the ping function in v 1.1. (Reference: https://github.com/privacy-tech-lab/gpc-web-crawler/issues/60#issuecomment-1664621751)
- Unpacking and storing both GPP v1.0 and v1.1 values in our database (let's write a new column to our crawl data after the z column "gpp_version")
So, values captured by both CMP API v1.0 and v 1.1 should be the same given that the only change in the version is the removal of the getGPPdata function and merging the ping function.
You should just fail any site providing an API that no one understands. At Prebid, we're removing support for reading GPP 1.0 signals entirely and GAM already has.
Hypothesis as of now: I think the problem with this is then these entities are looking for the newly merged ping function from the ver 1.1 CMP API, instead of the getGPPdata value from the v1.0
from https://github.com/privacy-tech-lab/gpc-web-crawler/issues/60#issuecomment-1664621751
What our code has: Our approach right now is to check for both getGPPdata and ping as discussed in issue-110 https://github.com/privacy-tech-lab/gpc-web-crawler/issues/60#issuecomment-1662842658
Possible solution: We can just completely scrap the getGPPdata function and solely use v1.1 ping CMP API
@patmmccann Could we clarify if, by changes in the GPP versions, you were referring to the changes in the CMP API versions? Currently, according to IAB, there is only one GPP version (1.0) but there are 2 CMP API versions (1.0 and 1.1) [CMP API captures the information of the GPP]
- Identifying whether a site implements GPP v1.0 or v1.1
- Unpacking and storing both GPP v1.0 and v1.1 values in our database (let's write a new column to our crawl data after the z column "gpp_version")
Action plan on our end:
@patmmccann Could we clarify if, by changes in the GPP versions, you were referring to the changes in the CMP API versions? Currently, according to IAB, there is only one GPP version (1.0) but there are 2 CMP API versions (1.0 and 1.1) [CMP API captures the information of the GPP]
Yes the CMP API version 1.1, which we probably should have called 2.0 but oh well. https://github.com/InteractiveAdvertisingBureau/Global-Privacy-Platform/pull/70
cc @lamrowena
We will be adding a new column in the crawl data that indicates whether the site uses CMP API v1.0 (one that has a getGPPdata function) or v1.1 (one that only uses ping function)
I suggest you get the version out of the ping response instead of testing for the absense of getgppdata. Some commercial vendors, eg @janwinkler, have backported getgppdata to assist in transitions yet still conform to the 1.1 spec and would have a signal recognized by platforms gathering the signal with the newly formatted eventlistener model.
Thank you @patmmccann! Your insight was very helpful in guiding the steps that I need to take to enhance the crawler functionality.
A note to self: I've confirmed that our current code do not test the version based on the absence of getgppdata. Instead, our injection script is modeled upon the update to the version 1.1 (ie. callback takes precedence over a return value since v 1.1 removed return values in favor of callback functions).
This would prioritize the 1.1 spec since all default gpp functions (including ping and getGPPData) used return values in v1.0 now have callback functions in v1.1. On the other hand, v1.0 would return values as expected, with some executing callback functions and some don't. Hence, sites would fall into 3 categories: v1.1 : callback only v1.0: executes callback and returns value v1.0: returns value only
In order to add the column in the crawler data, I believe these are the steps I should take: 1) adding new column in the rest-api (in the app.post in index.js)
2) explicitly collect the data of the different versions (which I believe has been analyzed under the function runAnalysis -- it is actually posted to debug)
3) log/store the data in the extension while a site is being analyzed (this is what analysis_userend[domain] is for)
4) populate the database with the collected information.
Logs/Update on adding the new column:
A side note: while figuring out the code for the addition of the gpp_version column, I also encountered some questions about some functions in analysis.js that I need to clarify and am currently asking Kate about it.
Next step: I will be repackaging the gpc-analysis-extension into xpi file and test the extension locally before making a commit.
Excellent!
Update: After successfully repackaging it to xpi-file, I ran the analysis. Unfortunately, it gave me null values for the gpp version. I also tried to debug using the debug column and the code actually managed to analyze which gpp version the site is using (eg. in the example attached: it detected that it's using the v1.1 above the 'empty'), however it still fails in storing and printing it in the analysis column.
I am currently taking another approach in the logic: instead of checking the version both before and after gpc signal is detected, I'm trying to write the code for collecting just one gpp version (regardless of whether it is after or before gpc signal is detected).
I've successfully added the code to identify the gpp version that a site is using, collecting the data of the gpp version and store that data in the new column (gpp_version). The result of the crawler on a site will be as attached below.
Next step: I have tested for 2 sites while writing and testing the code. I will begin testing for a slightly bigger sample size (10-20 sites) to ensure that the gpp version that the sites are using are recorded properly.
Excellent! Well done, @franciscawijaya!
Using the April Crawl Data, I tested the crawl for sites that output GPP strings (as tested in April) to check the gpp-version. Out of the 20 sites that I picked from the data, it seems that all of them used v1.1 and that data is reflected in the gpp-version column accurately. I also tested on sites that do not output GPP strings before and after the gpc signal is sent and as expected the column would reflect a 'null' value for gpc_version, since their gpp_before_gpc and gpp_after_gpc would also output a 'null' value.
In my testing and debugging of 20 sites, I have yet to encounter a site (that was crawled and identified to have a GPP string in April Crawl) that uses the v1.0. I'm not sure if this indicates and confirms that most sites have switched to the v1.1.
While I'm thinking of continuing my manual testing of other sites from the site list that had gpp strings in April Crawl to make sure of this switch, I wonder if there is a way for me to get a hold of sites that are still using v1.0 right now and test those sites out, instead of going through our site list.
I wonder if there is a way for me to get a hold of sites that are still using v1.0 right now and test those sites out, instead of going through our site list.
I tried searching BuiltWith to find sites with GPP. But it does not detect GPP. Maybe, there are similar lead generation sites like BuiltWith that do, though.
Another option may be to try the Internet Archive and Archive.today to see if they store sites with all their third parties.
It is also possible to create your own site with GPP v1.0. But let's not go there unless it is absolutely necessary.
Other than that, Google search for GPP v1.0 code snippets may get some relevant search results.
@patmmccann Hello! Would you mind sharing sites that still use the v1.0 when you came across this issue? My sample sets of sites seemed to have switched to 1.1 but I'm currently still looking to test our crawler for sites that still use the v1.0 version. I would greatly appreciate any help. Thank you in advance!
I've tried using web.archive to check sites before the release of GPP v1.1 to see if I can test the gpp version (which should be v1.0). However, I was not able to do this as web.archive does not seem to store sites with their third parties. I confirmed this by comparing the current site and the archive version on the web console which showed that the current sites stores a GPP string while the web.archive version does not.
I have also tried testing it on some other sites on our crawl list but I have yet to encounter v1.0. For now, I think we can just go ahead to try this code for the June crawl.
Next step: I will soon request to merge this branch with main and ask Matt to hep confirm reviews the code on his end via the local test and then close this issue. I will also be preparing for the crawl and began by this weekend.
Sounds all good, @franciscawijaya!
I am having trouble tracking down some of the old gpp implementations at the moment. Perhaps other outreach has been quite successful!
Merged to the main branch!
GPP 1.0 is no longer supported. If a site is broadcasting a GPP 1.0 signal, other entities on the page (eg Prebid.js or Google Ad Manager) generally will not understand it. You should just fail any site providing an API that no one understands. At Prebid, we're removing support for reading GPP 1.0 signals entirely and GAM already has.