privacy-tech-lab / gpc-web-crawler

Web crawler for detecting websites' compliance with GPC privacy preference signals at scale
https://privacytechlab.org/
MIT License
4 stars 2 forks source link

Evaluate implementation of human check circumvention functionality in the crawler #135

Open SebastianZimmeck opened 3 weeks ago

SebastianZimmeck commented 3 weeks ago

Before the next crawl (#118), we should look into what caused the divergence of crawl results from the manual analysis results in our most recent analysis. See the red-labeled fields:

Screenshot 2024-09-18 at 9 03 57 AM

Do increased timeouts help? Maybe, some sites were not fully loaded before the data was captured. Are there other parameters that we can fine-tune to improve accuracy?

@franciscawijaya will take the lead here and work with @natelevinson10 and @eakubilo before starting the next crawl.

franciscawijaya commented 2 weeks ago

Some of my findings: 1) Optanonconsent after GPC

2) Pridecounseling.com redirect

The good news is that

Given that this error is correctly logged in the error logging and would only be able to be accessed manually, I think it would still be okay to start the crawl. Should we encounter this in our crawl, this has better informed us about the possible reason of cases where optanonconsent cookies are 0 both before and after GPC for sites that are also flagged with human check errors. I will begin the California crawl shortly.

Screenshot 2024-09-23 at 3 27 57 AM
SebastianZimmeck commented 2 weeks ago

@franciscawijaya, for the Optanonconsent after GPC:

  1. Were some of the sites also in our previous crawls? If so, what were the results then?
  2. Do you get the same results for Colorado, California, another VPN? In other words, is this VPN-dependent?

@natelevinson10 and @eakubilo, if you have a chance, please also look into this issue? This is the most important issue at the moment. It would be good to get a good understanding before we start the next crawl.

SebastianZimmeck commented 1 week ago

@natelevinson10 is saying:

Did a quick review of the manual data we collected a couple of weeks ago and targeted instances of a mismatch (marked in red) signaling our ground truth was different than the crawl data. I used several VPN locations (California, multiple Colorado, Virginia, and no VPN (CT)) and gave ample time to let all of the site content to load.

I was not able to find a single instance of our manual data changing from what we had reported, except for bumble.com 's USPapi_before being "1YNN" instead of the reported "1YYN", and I would chalk this up to a manual error on our end. It would seem that for the mismatches of data from crawl to manual, the manual data is more accurate.

franciscawijaya commented 1 week ago
  1. Were some of the sites also in our previous crawls? If so, what were the results then?

I wanted to be precise and check the progression of the data across the different rounds of Crawl that we did and these are the findings: 1) Dickies.com: Dec-April (isGpcEnabled=1) [matches Manual Data], June Crawl (isGpcEnabled=0) [matches recent Crawl Data] --> high priority: sudden change in the Crawl output 2) Altrarunning.com: Dec-June (isGpcEnabled=1) [matches Manual Data] --> high priority: the Crawl output has been accurate since the first crawl so it is strange that we are getting isGpcEnabled = 0 now 3) Smartwool.com: Dec-June (isGpcEnabled=1) [matches Manual Data] --> high priority: the Crawl output has been accurate since the first crawl so it is strange that we are getting isGpcEnabled = 0 now 4) Goodrx.com: NULL for uspapiaftergpc [matches recent Crawl Data] --> low priority: the difference between Crawl & Manual has been present since the first Crawl 5) Redrobin.com: Dec-Feb (isGpcEnabled=0) [matches Recent Crawl Data], April-June (isGpcEnabled=1) [matches Manuall Data] --> high priority: it's strange that the Crawl has gotten an output of isGpcEnabled = 1 after previously ouputting 0 but is now back to 0

Analysis: While there are only 5 overlapping sites, their outputs in the past crawls have mostly been the correct ones, with the exception of Dickies.com so it's pretty strange that we got the opposite for our recent small crawl.

2. Do you get the same results for Colorado, California, another VPN? In other words, is this VPN-dependent?

Since I used Colorado VPN for the recent small crawl, I also have done a crawl focusing on these focus sites with California crawl. However, after analyzing the result with California VPN, I realized that the VPN could be a potential problem as it gave minor differences in the outputs. Hence what I will do is redo these focus mini crawls with more VPN IP addresses to have more data to compare (ie. using more than 1 Colorado and California VPN) and confirm before writing the analysis here.

SebastianZimmeck commented 1 week ago

Thanks, @franciscawijaya!

While there are only 5 overlapping sites, their outputs in the past crawls have mostly been the correct ones, with the exception of Dickies.com so it's pretty strange that we got the opposite for our recent small crawl.

That is helpful to know! So, before starting the next crawl we should try to understand what the reasons for these performance drops are.

I realized that the VPN could be a potential problem as it gave minor differences in the outputs. Hence what I will do is redo these focus mini crawls with more VPN IP addresses to have more data to compare

Yes, that is a good point to try.

@natelevinson10, can you coordinate with @franciscawijaya and also look into this as a team?

franciscawijaya commented 1 week ago

As mentioned during our call, I redid the crawl using 6 different VPN from Colorado and California and the manual-crawl data discrepancy was unfortunately still present. Hence, I concluded to rule out a VPN issue.

I also looked through the particular human check error from all these different sites as all of their human checks were powered by the same company, PerimeterX . Examining the past crawl results, I started to wonder if the cause of this discrepancy is a recent integration of this particular human check. I have also researched and browsed through the possibility of bypassing this human check to find useful resources.

In the next couple of weeks, @natelevinson10 and I will try to examine the possibility of bypassing the human check error. As it stand right now, our crawler is correctly identifying and flagging the human check error so we will now be exploring the possibility of bypassing human checks with our crawler.

SebastianZimmeck commented 1 week ago

Nice work, @franciscawijaya!

Our approach is two-pronged:

  1. Check if we can add human check circumvention functionality to our crawler (at the moment, we are only logging human checks; we are not trying to circumvent those; this is not only true for PerimiterX but for any human check service)
  2. If we come to the conclusion (after, say, two weeks) that it is too difficult to circumvent human checks, make sure that the logging of human checks is accurate and captures PerimeterX so that we can exclude data from sites with human checks post-crawl in the data analysis

Since we have no human check circumvention functionality at all in the crawler, this issue is not limited to PerimeterX but also other human check services.

As a starting point for a potential human check functionality @natelevinson10 will do a mini crawl of sites with human check known to us and observe their behavior, which can then point towards circumvention strategies.

@natelevinson10 and @franciscawijaya will coordinate who does what on this issue.

natelevinson10 commented 4 days ago

I compiled a list of sites that were tagged as "HumanCheckError" from our June 2024 crawls batches 1-2. This list is comprised of 97 sites total. I ran an initial crawl with timeouts set to 10,000 (for crawler) and 5,000 (for analysis), and have uploaded the results in a new "Human_Check_Analysis" folder in the drive.

It seems like the majority (~80%) of the sites threw a human verification error. I think it could be a good idea to crawl the same sites on a few different VPN locations to see if the errors are consistent across numerous IP addresses. This is something I could look into next week

SebastianZimmeck commented 3 days ago

I compiled a list of sites that were tagged as "HumanCheckError" from our June 2024 crawls batches 1-2. This list is comprised of 97 sites total. I ran an initial crawl with timeouts set to 10,000 (for crawler) and 5,000 (for analysis), and have uploaded the results in a new "Human_Check_Analysis" folder in the drive.

@natelevinson10, what is the path to this folder? Can you also link it here? I see a "Human_Verification_Crawl" folder. Is that the folder you mean?

It seems like the majority (~80%) of the sites threw a human verification error.

So, these were sites that all had a human check in June 2024 and 20% of those no longer have such?

I think it could be a good idea to crawl the same sites on a few different VPN locations to see if the errors are consistent across numerous IP addresses. This is something I could look into next week

Let's hold off of that for the time being. It is likely that a different VPN IP addresses will be also blocked.

Just to remind what the point of this whole exercise is: per our results we see the following (apparently due to a human check; is the human check the issue for these discrepancies?):

Screenshot 2024-10-06 at 1 06 54 PM

Are these occurrences of new human checks that we have not seen before so frequent that we need to address those by trying to add circumenvention functionality to the crawler?

natelevinson10 commented 2 days ago

Here is the path to the folder: https://drive.google.com/drive/u/1/folders/1wdAvF_-Tkfux-lD0rDahZzZ3Az_NQdbM

80% of the sites had a human check error, and a hanfdul errror'd out, meaning ~15 sites did not have a human check error despite having one earlier. As for the CPA GPC data that we manually reviewed, it would be a good idea to run crawls on just these sites and see if they every throw a human verification. Possibly crawling the few sites with data mismatches several times, changing the IP each time and compiling the results. This is definitely something I can do over the next few days.

SebastianZimmeck commented 2 days ago

As for the CPA GPC data that we manually reviewed, it would be a good idea to run crawls on just these sites and see if they every throw a human verification. ... This is definitely something I can do over the next few days.

Excellent, that is a good idea! Let's do that.